All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog.
- Added a bounded scenario generator portfolio for reclaim, preempt, and consolidation search, with
SchedulingShard.spec.scenarioSearchBudgetstime-budget configuration and production scenario-search metrics. - Added an opt-in
deviceaccessadmission plugin (--block-nvidia-visible-devices, config fieldadmission.blockNvidiaVisibleDevices, default disabled) that (1) rejects pods overriding theNVIDIA_VISIBLE_DEVICESenvironment variable with values other thanvoid/none(or via avalueFromreference), and (2) injectsNVIDIA_VISIBLE_DEVICES=voidinto containers that do not request a GPU, blocking their access to GPUs on the node. - Added support for configuring admission Pod Disruption Budget via Helm values (
admission.podDisruptionBudget) #1490 dttung2905 - Added an opt-in
hamicorebinder plugin (depends ongpusharing) to write the HAMI-core GPU memory limit (CUDA_DEVICE_MEMORY_LIMIT) for fractional GPU pods. - Added
global.podSecurityContext,global.resourceReservation.namespaceLabels,nodescaleadjuster.labels,crdupgrader.resources,topologyMigration.resources, andpostCleanup.resourcesto the Helm. chart. - Skill to capture and run snapshots
- Added
kaiConfigDeployer.enabledHelm value (defaulttrue) to allow disabling the post-install/post-upgrade hook that applies the kai-config CR, for managing the CR outside of the chart. - Added
defaultShard.enabledHelm value (defaulttrue) to allow installing KAI without deploying the chart-managed defaultSchedulingShardCR. - Added NUMA-aware scheduling (v1). A new
numascheduler plugin predicts the kubelet Topology Manager's admission verdict for thesingle-numa-nodeandrestrictedpolicies fromNodeResourceTopology(NRT) data, to preventTopologyAffinityErrors. The plugin is opt-in per shard (enable thenumaplugin in aSchedulingShard). Shipped alongside it is an optional per-node NUMA placement exporter DaemonSet that exports the kubelet podresources API to each pod's observed NUMA placement. This is v1 — feasibility filtering and per-zone correctness; node scoring/optimization is future work. See the design for details: NUMA-Aware Scheduling via NodeResourceTopology.
- Scoped admission
runtimeClassNameinjection to GPU fraction pods only; whole-GPU pods are no longer mutated.admission.gpuPodRuntimeClassNameis deprecated in favor ofadmission.gpuFractionRuntimeClassName. Reservation podruntimeClassNamenow defaults to empty. #1543 davidLif - Removed redundant
PodDisruptionBudgetImplementedguard from operator PDB creation helper #1613 dttung2905 - Updated Go toolchain and base build images to v1.26.3.
- Breaking: The podgroup produced for JobSet is now produces as a single PodGroup per JobSet with a two-level SubGroup hierarchy (one parent SubGroup per
replicatedJob, one leaf SubGroup per replica) regardless ofstartupPolicyOrder. Thekai.scheduler/batch-min-memberannotation on the JobSet now overrides the rootminSubGroup; the same annotation onreplicatedJobs[].template.metadata.annotationsoverrides the leafminMember(defaulting totemplate.spec.parallelism). #1617 davidLif - Optimized the job solver to run the full allocation probe only once after partial search finds at least one solvable pending task.
- Reduced scheduler heap retention after scheduling cycles by clearing completed session snapshots and callback references, and by releasing the node scoring pool without waiting for finalizers.
- Fixed Helm chart prometheus RBAC always being installed when
prometheus.enabledis false, and thekai-prometheusClusterRoleBinding referencing theprometheusServiceAccount in hardcodedkai-schedulernamespace instead of the Helm release namespace #1684 dttung2905 - Fixed post-delete cleanup hook hardcoding
kai-schedulernamespace instead of Helm release namespace onhelm uninstall#1619 dttung2905 - Improved solver performance in some large reclaim scenarios #1627 itsomri
- Grove grouper now sets
minSubGroup(equal to the number of child SubGroups) instead ofminMember=0on parent SubGroups generated fromtopologyConstraintGroupConfigs#1639 davidLif - Fixed Helm chart not wiring
podgrouper.queueLabelKeyintospec.global.queueLabelKeyon the Config CR, so custom queue label keys were ignored at install time #1655 dttung2905 - Fixed scheduler nil-pointer panic in the preempt scenario builder when a (partial) job has no tasks to allocate (
NewIdleGpusFilterdereferenced a nil scenario); added the missing nil-guard matching the sibling filters #1664 sam-huang1223 - Fixed default node-scale-adjuster image name (
node-scale-adjuster→nodescaleadjuster) so it matches the image published to GHCR - Fixed duplicate GPU reservation pods being created for a single
gpu-groupon a node (each reserving a different physical GPU), which corrupted the scheduler's fractional-GPU accounting and left devices unschedulable. Reservation pods are now named deterministically per (node, gpu-group) and treat AlreadyExists as success, so concurrent or retried binds collide on one object instead of duplicating #1673
- Added
enabledHelm values forbinder,podgrouper,podgroupcontroller,queuecontroller,admission, andschedulerto allow disabling individual components from values.yaml. Previously these were hardcoded totruein the kai-config template. - Added
prometheus.enabledandprometheus.externalPrometheusUrlHelm values to configure Prometheus from values.yaml #907 - Added validation for
subgroupname in podgroup faizanexe - Added memory profile and run duration to snapshot tool #1411
- Added support for configuring pod and container security contexts on resource reservation pods via CLI flags AdheipSingh
- Added
operator.logLevelHelm value to configure the operator log level (maps to--zap-log-levelwhen set) #1446 dttung2905 - The scheduler now implements elastic PodGroups on both the subgroup level (
minSubGroup) and pods (minAvailable). This allows for elasticity on all of the podgroup tree hierarchy. #1416 - davidLif - Allow the configuration of plugins in the binder service. #1480 - davidLif
- Added support for configuring scheduler log level and custom scheduler args via Helm values (
scheduler.args) #1452 dttung2905 - Added
global.jsonLogHelm value to enable JSON-formatted logging for use with log aggregation platforms - Added
crdupgrader.image.registryHelm value to overrideglobal.registryfor thecrd-upgraderpre-install/pre-upgrade hook image, allowing the hook image to be served from a separate mirror without redirecting all chart images. #1404 - Added
queue_metadata_nameandqueue_display_namelabels to all queue metrics emitted by both the scheduler (queue_fair_share_*,queue_*_usage) and the queue-controller (queue_info,queue_deserved_gpus,queue_quota_*,queue_allocated_*).queue_metadata_namealways carries the Queue'smetadata.nameand is the recommended join key between scheduler and queue-controller metrics;queue_display_namecarriesspec.displayName(empty when unset). The legacyqueue_namelabel is preserved unchanged to keep existing dashboards working. #1566 - Added support for externally-created PodGroups. Workloads can opt out of podgrouper mutation with
kai.scheduler/skip-podgrouper: "true"on the pod or owner chain, join an existing PodGroup viapod-group-name, and now get a pod condition when they reference a non-existent subgroup. #1420 - Added
--stuck-in-releasing-thresholdscheduler flag (default2m) controlling how long a Running pod with adeletionTimestampremains classified asReleasingbefore being reclassified asStuckInReleasingand excluded from pipelining. Configurable per shard viaSchedulingShard.spec.args.stuck-in-releasing-threshold.
- Breaking: JobSet PodGroups no longer auto-calculate
minAvailablefromparallelism × replicas. The default is now 1. Use thekai.scheduler/batch-min-memberannotation to set a custom value. - Bumped
k8s.io/*module group from v0.34.x to v0.35.4,k8s.io/kubernetesto v1.35.4, andsigs.k8s.io/controller-runtimeto v0.23.3, enabling KEP-4671 Workload API types. #1466 - Rebuilt the
crd-upgraderhook image onalpine:3.20instead ofubi9/ubi-minimal. Image size drops from ~165 MB to ~67 MB uncompressed (~60% reduction), shrinking cold-pull latency on ephemeral CI runners. The image is also reused by thetopology-migrationandpost-deletehook jobs as a generickubectl + bashtoolbox, so bash is preserved on the runtime image. #1404
- Account for native sidecar containers (initContainers with
restartPolicy: Always, KEP-753) in pod resource accounting, matching kubelet'sAggregateContainerRequests. Previously, native sidecar requests were max'd against regular containers instead of summed with them, causing the scheduler to bind pods that kubelet then rejected at admission withOutOfCpu/OutOfGpu. #1556 - Streaming snapshot JSON directly into the zip writer to avoid OOM on large clusters. The
/get-snapshotendpoint previously buffered the entire JSON payload in memory (~3x the data size); it now streams per-element, reducing peak memory to ~1x. #1564 - Fixed
additionalImagePullSecretsin Config CR rendering asmap[name:...]instead of plain strings by extracting.namefromglobal.imagePullSecretsobjects. Also propagatedglobal.imagePullSecretsto all Helm hook jobs (crd-upgrader,topology-migration,post-delete-cleanup) - Added
global.nodeSelector,global.tolerations,global.affinity,global.securityContextsupport to the post-delete job hook. - Fixed Helm template writing
imagesPullSecret(string) instead ofadditionalImagePullSecrets(array) in Config CR, causing image pull secrets to be silently ignored. Added backward-compatible deprecatedimagesPullSecretfield to CRD schema. #942 - Fixed
windowSizefield inSchedulingShardCR to support Prometheus duration format (e.g.1w,7d). Previously, usingwindowSize: 1was shown in the documentation caused the kai-operator to crash-loop withtime: unknown unit "w" in duration "1w". - Race condition where
SyncForGpuGroupcould prematurely delete reservation pods when the informer cache had not yet propagated GPU group labels on recently-bound fraction pods. The binder now checks for active BindRequests referencing the GPU group before deleting a reservation pod. - Fixed non-preemptible multi-device GPU memory jobs being allowed to exceed their queue's deserved GPU quota. The per-node quota check now correctly accounts for all requested GPU devices. #1369
- Added
resourceclaims/bindingRBAC permission to the binder ClusterRole for compatibility with Kubernetes v1.36+, where theDRAResourceClaimGranularStatusAuthorizationfeature gate requires explicit permission on theresourceclaims/bindingsubresource to modifystatus.allocationandstatus.reservedForon ResourceClaims. #1372 praveen0raj - Allow users to override minMember for k8s batch Jobs and JobSets using the
kai.scheduler/batch-min-memberannotation #1308 itsomri - Fixed a bug where nil minMember caused subgroups creation to fail in scheduler #1407 itsomri
- Improved performance by evaluating SetNode once per session instead of on each predicate evaluation #1421 itsomri
- Added persistent volumes to cluster snapshot #1424 itsomri
- Improved scheduling performance for preempt/reclaim/consolidate actions on jobs with many tasks by replacing per-task linear probing with exponential+binary search in the job solver, reducing the number of scenario simulations from O(n) to O(log n) #1435 itsomri
- Avoid expensive solver-backed reclaim/preempt/consolidation work for jobs already blocked by victim-invariant pre-solver failures such as missing PVCs, missing required ConfigMaps, or requests larger than the maximum node size. #1502
- Fixed
skipTopOwnerGroupernot propagating per-type defaults (priority class and preemptibility) for skipped owners (e.g.DynamoGraphDeployment), causing PodGroup spec to retain stale values after defaults ConfigMap updates. - Fixed binder DRA detection on clusters where the upstream
DynamicResourceAllocationfeature gate does not reflect server-side DRA availability. The binder now probes the API server during init (matching the scheduler) so the DRA plugin is gated on the same authoritative decision. #1481 - Suppressed noisy
Reconciler errorlogs andPodGrouperWarningevents on transient PodGroup update conflicts. The podgrouper now treatsIsConflicterrors as expected and silently requeues the reconcile instead of surfacing the apiserver's "object has been modified" message. - Stopped recreating the
kai-configCR on everyhelm upgrade. The CR is now applied by a post-install/post-upgrade hook Job (kai-config-deployer) usingkubectl apply --server-sideinstead of being a Helm-managed resource, so its UID stays stable across upgrades. Previously, the defaultbefore-hook-creationpolicy deleted and recreatedkai-configon every upgrade, cascading viaownerReferencesto all operandServiceAccounts(includingscheduler). When an upgrade did not change the scheduler Deployment pod template, scheduler pods kept their old projected tokens — bound to the now-deleted SA UID — and failed every API call with401 Unauthorizeduntil kubelet rotated the token at ~80% TTL. A matching post-delete Job removes the CR onhelm uninstall. #1536 - Fixed kai-operator not reconciling on Prometheus and ServiceMonitor changes. The Config controller now watches owned
PrometheusandServiceMonitorresources, so deletions and drift trigger reconciliation. CRD presence is checked at startup against the API server (the scheme-only check used previously could not detect missing CRDs), and the watch is registered only when the CRDs are installed. #877 - Added
before-hook-creationto thecrd-upgraderHelm hook delete policy so failed hook Jobs no longer block subsequenthelm upgrade --installretries. Aligns with the policy already used by the chart's other hook resources. #1404 - Fixed kai-operator leader-election event emission by adding RBAC permission for core
events(create,patch,update) so operators can publish leadership events instead of loggingevents is forbidden. #1572 dttung2905 - The scheduler's per-shard Service is now populated by an operator-managed
EndpointSlicepointing at the current leader-election Lease holder, which is connected to the service of the shard's scheduler. This allows the service to route all it's incoming request to the lease-holding pod of the scheduler deployment. #1593 davidLif - Fixed
podgroupcontrollerlogging spurious errors on every reconcile for completed/failed pods because it tried to fetch DRAResourceClaimobjects that the DRA driver had already deleted. Terminal pods now skip the ResourceClaim lookup entirely, mirroring the scheduler-side fix in #1456. #1529
- Added queue validation webhook to queuecontroller with optional quota validation for parent-child relationships AdheipSingh
- Added support for VPA configuration for the different components of the KAI Scheduler - jrosenboimnvidia
- Users that have VPA installed on their cluster can now utilize it for proper vertical autoscaling
- Added FOSSA scanning for the repository context. Scans will also be performed for submitted PRs. The results can be found here. #1178 - davidLif
- Added support for Ray subgroup topology-aware scheduling by specifying
kai.scheduler/topology,kai.scheduler/topology-required-placement, andkai.scheduler/topology-preferred-placementannotations. - Allow subgroups to have a 0 value for "minAvailable". This means that all pods in this subgroup are "elastic extra pods". #1216 davidLif
- Added a display web page for Scale test results for public viewing #1154 SiorMeir
- Auto-enable leader election when
operator.replicaCount> 1 to prevent concurrent reconciliation #1218 - Update go version to v1.26.1, With appropriate upgrades to the base docker images, linter, and controller generator. #1222 - davidLif
- Updated resource enumeration logic to exclude resources with count of 0. #1120
- Fixed scheduler on k8s < 1.34 with DRA disabled.
- Fixed pod group controller failing to track DRA GPU resources on Kubernetes 1.32-1.33 clusters. #1214
- Fixed scheduling-constraints signature hashing for
Priorityand containerHostPortby encoding fullint32values, preventing byte-truncation collisions and flaky signature tests. - Fixed rollback in scheduling simulations with DRA #1168 itsomri
- Fixed a potential state corruption in DRA scheduling simulations #1219 itsomri
- Fixed operator reconcile loop caused by status-only updates triggering re-reconciliation. #1229 cypres
- Fixed scheduler not starting on k8s clusters with DRA disabled, due to the ResourceSliceTracker not syncing. #1241 cypres
- Fixed webhook reconcile loop on AKS, by retaining the cloud-provider-injected namespaceSelector rules during reconciliation. #1292 cypres
- Added
minSubGroupfield to PodGroup and SubGroup API to support specifying the minimum number of child SubGroups required for elastic gang scheduling, along with validation to prevent simultaneous use ofminSubGroupandminMemberfields (#TBD) by KAI Dev Agent - Added
global.nodeSelectorpropagation from Helm values to Config CR, ensuring operator-created sub-component deployments (admission, binder, scheduler, pod-grouper, etc.) receive the configured nodeSelector #1102 yuanchen8911 - Added
pluginsandactionsfields to SchedulingShard spec, allowing per-shard customization of scheduler plugin/action enablement, priority, and arguments gshaibi - Added support for Kubeflow Trainer v2 TrainJob workloads via skipTopOwner grouper pattern
- Added
binder.cdiEnabledHelm value to allow explicit override of CDI auto-detection for environments without ClusterPolicy - Added metric for tracking evicted pods in pod groups, including nodepool, eviction action, and gang size
- Block scheduling of pods with shared (non-template) DRA GPU claims that lack a queue label or have a mismatched queue label gshaibi
- Added the option to disable prometheus service monitor creation #810 itsomri
- Fixed prometheus instance deprecation - ensure single instance #779 itsomri
- Added clear error messages for jobs referencing missing or orphan queues, reporting via events and conditions #820 gshaibi
- Added rule selector for resource accounting prometheus #818 itsomri
- Made accounting labels configurable #818 itsomri
- Added support for Grove hierarchical topology constraints in PodGroup subgroups
- Added support for n-level queue hierarchies #858 gshaibi
- Added labels and annotations propagation from topOwner in SkipTopOwner grouper #861 SiorMeir
- Added scheduler name match conditions to admission webhooks to improve cluster stability
- Add Gpu Dra claims and resource slices accounting for the purpose of resource management and quota guarantees. *** This change doesn't support shared gpu claims or gpu claims with FirstAvailable *** #900 davidLif
- Added DRA resources recording to snapshot #830
- Temporarily Prevent device-plugin GPU pods on DRA-only nodes - until translation between device-plugin notation and DRA is implemented
- Implemented subgroups for pytorchjobs #935 itsomri
- Made KAI images distroless #745 dttung2905
- Allow setting empty gpuPodRuntimeClassName during helm install #972 steved
- Created scale tests scenarios for running scale tests for KAI #967
- Implemented block-level segmentation for pytorchjobs #938 itsomri
- Added scale test environment setup script and updated service monitors for KAI scheduler #1031
- Implemented subgroups for leaderworkerset #1046 davidLif
- Added discovery data to snapshot for more accurate debugging #1047 itsomri
- Implemented subgroup segmentation (with topology segment definitions) for leaderworkerset #1058 davidLif
- Fixed operator status conditions to be kstatus-compatible for Helm 4
--waitsupport: addedReadycondition and fixedReconcilingcondition to properly transition to false after reconciliation completes #1060 - Fixed a bug where the node scale adjuster would not check if a pod was unschedulable before creating a scaling pod leading to unnecessary node scaling #1094 slaupster
- Fixed admission webhook to skip runtimeClassName injection when gpuPodRuntimeClassName is empty #1035
- Fixed topology-migration helm hook failing on OpenShift due to missing
kai-topology-migrationservice account in thekai-systemSCC #1050 - Fixed a bug where queue status did not reflect its podgroups resources correctly #1049
- Fixed helm uninstall does not remove webhooks #959 faizan-exe
- Fixed security vulnerability where PodGang could reference pods in other namespaces, preventing cross-namespace manipulation
- Fixed pod controller logging to use request namespace/name instead of empty pod object fields when pod is not found
- Fixed a bug where topology constrains with equal required and preferred levels would cause preferred level not to be found.
- Fixed GPU memory pods Fair Share and Queue Order calculations
- Interpret negative or zero half-life value as disabled #818 itsomri
- Handle invalid CSI StorageCapacities gracefully #817 rich7420
- Embed CRD definitions in binary for env-test and time-aware-simulations to allow binary portability #818 itsomri
- Fixed missing
podGrouperconfiguration in Helm template that prevented podgrouper values from being applied #860 - Fixed rollback for failed bind attempts #847 itsomri
- Fixed missing
namespace,serviceAccountName, andappLabelfields inresourceReservationsection of kai-config Helm template #860 dttung2905 - If a preferred topology constraint is set, do not try to find a lowest common subtree (as a part of the calculations optimizations) which is lower then the preferred level
- Added dedicated
usage-prometheusservice for scheduler Prometheus access with configurable instance name #896 itsomri - ClusterPolicy CDI parsing for gpu-operator > v25.10.0
- Fixed missing
repository,tag, andpullPolicyfields inresourceReservationImagesection of kai-config Helm template #895 dttung2905 - Fixed a bug in ray gang scheduling where not all worker groups' minMember would be respected #924 itsomri
- cpu-only nodes calculation in DRA enabled clusters #944
- enable DRA flag override fix in snapshot-tool #955
- Fixed ConfigMap predicate to respect the Optional field and now considers ConfigMaps in projected volumes and ephemeral containers
- Fixed simulations that failed due to pod capacity on node #969 itsomri
- Fixed a bug where some resource claims would remain marked as bound to devices forever
- Removed the constraint that prohibited direct nesting of subgroups alongside podsets within the same subgroupset.
- Fixed plugin server (snapshot and job-order endpoints) listening on all interfaces by binding to localhost only.
- Removed redundant
connectionfield fromGlobalConfigin favor ofPrometheus.ExternalPrometheusUrlfor external Prometheus URL configuration
- Introduced native KAI Topology CRD to replace dependency on Kueue's Topology CRD, improving compatibility and simplifying installation
- Added support for having the default "preemptibility" per top-owner-type read from the default configs configmap in the pod-grouper
- Added option to profile CPU when running the snapshot tool #726 itsomri
- GPU resource bookkeeping for DRA enabled resources
- Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
- Added an option to disable prometheus persistency #764 itsomri
- If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #756 itsomri
- When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #756 itsomri
- Fixed a bug where the snapshot tool would not load topology objects #720 itsomri
- Operator to conditionally watch ClusterPolicy based on its existence, preventing errors in its absence
- Fixed confusing resource division log message #733 itsomri
- Made post-delete-cleanup resources configurable #737 dttung2905
- GPU Memory pods are not reclaimed or consolidated correctly
- Added missing leases permission for the operator #753 dttung2905
- Fixed reclaim/preempt/consolidate actions for topology workloads #739 itsomri
- Fixed a bug where the scheduler would not consider topology constraints when calculating the scheduling constraints signature #761 gshaibi
- Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table
- Keep creating service monitors for deprecated prometheus instances #774 itsomri
- Fix retention duration parsing for deprecated prometheus instances #774 itsomri
- Renamed the previous "tumbling" option for the scheduler usage window type to "cron".
- Removed the requirement to specify container type for init container gpu fractions #684 itsomri
- When a status update for a podGroup in the scheduler is flushed due to update conflict, delete the update payload data as well #691 davidLif
- Fixed scheduler pod group status update conflict #676 davidLif
- Fixed gpu request validations for pods #660 itsomri
- Dependabot configuration to update actions in workflows #651 ScottBrenner
- optimize dependency management by using module cache instead of vendor directory #645 lokielse
- Added parent reference to SubGroup struct in PodGroup CRD to create a hierarchical SubGroup structure
- Added the option to configure the names of the webhook configuration resources.
- Option to configure reservation pods runtime class.
- Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
- Added enforcement of the
nvidiaruntime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely. - Added a preferred podAntiAffinity term by default for all services, can be set to required instead by setting
global.requireDefaultPodAffinityTerm - Added support for service-level affinities
- Added time aware scheduling capabilities
- Added option to specify container name and type for fraction containers
- (Openshift only) - High CPU usage for the operator pod due to continues reconciles
- Fixed a bug where the scheduler would not re-try updating podgroup status after failure
- Fixed a bug where ray workloads gang scheduling would ignore
minReplicasif autoscaling was not set - KAI Config wrong statuses when prometheus operand is enabled
- GPU-Operator v25.10.0 support for CDI enabled environments
- Option to configure reservation pods runtime class.
- Fixed Helm chart compatibility with Helm 4 wait logic to prevent indefinite hangs during deployment readiness checks
- Added the option of providing the podgrouper app a scheme object to use
- config.kai.scheduler CRD that will describe the installation of all KAI-scheduler services for the operator
- Initial KAI-operator implementation for managing components
- PodGroup Controller, Queue Controller, Admission and Scale Adjuster operands to operator lifecycle management
- Deployment of operator in Helm chart alongside pod group controller
- Deploy PodGroup Controller, Queue Controller, Admission and Scale Adjuster via operator for streamlined deployment
- schedulingshrards.kai.scheduler CRD that describes partitioning the cluster nodes for different scheduling options.
- Moved the CRDs into the helm chart so that they are also installed by helm and not only by the crd-upgrader, but removed the external kueue clone of topology CRD from being automatically installed.
- Updated queue controller image name to align with current deployment standards
- Removed webhook manager component as part of operator-based refactoring
- Added configurable plugins hub for podgrouper using interface and RegisterPlugins
- Added a plugin to reflect joborder in scheduler http endpoint - Contributed by Saurabh Kumar Singh singh1203.ss@gmail.com
- Fixed a bug where workload with subgroups would not consider additional tasks above minAvailable
- Removed unused code that required gpu-operator as a dependency
- Fixed wrong GPU memory unit conversion from node
nvidia.com/gpu.memorylabels - Fixed incorrect MIG GPU usage calculation leading to wrong scheduling decision
- Added a new scheduler flag
--update-pod-eviction-condition. When enabled, a DisruptionTarget condition is set on the pod before deletion
- Fixed scheduler panic in some elastic reclaim scenarios
- Added leader election configuration in all deployments and added global helm value that controls it during installation
- Separated admission webhooks from binder service to a separate
kai-admissionservice
- crd-upgrader respects global values for nodeSelector, affinity and tolerations
- kai-scheduler will not ignore pod spec.overhead field
- Fixed container env var overwrite to cover possible cases where env var with Value is replaced with ValueFrom or the other way
- Fixed a scenario where only GPU resources where checked for job and node, causing it to be bound instead of being pipelined
- Added GPU_PORTION env var for GPU sharing pods
- Fixed a miscalculation where cpu/memory releasing resources were considered idle when requesting GPU fraction/memory
- Changed RUNAI-VISIBLE-DEVICES key in GPU sharing configmap to NVIDIA_VISIBLE_DEVICES
- Removed GPU sharing configmap name resolution from env vars and volumes
- Added LeaderWorkerSet support in the podGrouper. Each replica will be given a separate podGroup.
- Added kueue topology CRD to kai installations
- Fixed cases where reclaim validation operated on outdated info, allowing invalid reclaim scenarios
- Added optional pod and namespace label selectors to limit the scope of monitored pods
- Added a plugin extension point for scheduler plugins to add annotations to BindRequests
- Added support for Grove
- Changed
run.ai/top-owner-metadatatokai.scheduler/top-owner-matadata
- Changed
runai-reservationnamespace tokai-resource-reservation. For migration guide, refer to this doc - Changed
runai/queuelabel key tokai.scheduler/queue. For migration guide, refer to doc
- Fixed pod status scheduled race condition between the scheduler and the pod binding
- Removed redundant
replicaskey for binder fromvalues.yamlas it is not used and not supported
- Removed
runai-job-idandrunai/job-idannotations from pods and podgroups
- Added minruntime plugin, allowing PodGroups to run for a configurable amount of time without being reclaimed/preempted.
- PodGroup Controller that will update podgroups statuses with allocation data.
- Queue Controller that will update queues statuses with allocation data.
- Added support for k8s pod scheduling gates
- nodeSelector, affinity and tolerations configurable with global value definitions
- Added
PreemptMinRuntimeandReclaimMinRuntimeproperties to queue CRD - Scheduler now adds a "LastStartTimestamp" to podgroup on allocation
- Queue order function now takes into account potential victims, resulting in better reclaim scenarios.
- Fixed preempt/reclaim of elastic workloads only taking one pod.
- Scheduler now doesn't label pods' nodepool when nodepool label value is empty