Skip to content

Releases: NVIDIA/KAI-Scheduler

v0.12.0

24 Dec 14:41
c2e76fb

Choose a tag to compare

Note: v0.12 is a stable release as described in SUPPORT.md

What's Changed

Added

  • Introduced native KAI Topology CRD to replace dependency on Kueue's Topology CRD, improving compatibility and simplifying installation
  • Added support for having the default "preemptibility" per top-owner-type read from the default configs configmap in the pod-grouper
  • Added option to profile CPU when running the snapshot tool #726 itsomri
  • GPU resource bookkeeping for DRA enabled resources
  • Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
  • Added an option to disable prometheus persistency #764 itsomri

Changed

  • If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #756 itsomri
  • When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #756 itsomri

Fixed

  • Fixed a bug where the snapshot tool would not load topology objects #720 itsomri
  • Operator to conditionally watch ClusterPolicy based on its existence, preventing errors in its absence
  • Fixed confusing resource division log message #733 itsomri
  • Made post-delete-cleanup resources configurable #737 dttung2905
  • GPU Memory pods are not reclaimed or consolidated correctly
  • Added missing leases permission for the operator #753 dttung2905
  • Fixed reclaim/preempt/consolidate actions for topology workloads #739 itsomri
  • Fixed a bug where the scheduler would not consider topology constraints when calculating the scheduling constraints signature #761 gshaibi
  • Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table
  • Keep creating service monitors for deprecated prometheus instances #774 itsomri
  • Fix retention duration parsing for deprecated prometheus instances #774 itsomri

Changed

  • Renamed the previous "tumbling" option for the scheduler usage window type to "cron".

New Contributors

Full Changelog: v0.10.2...v0.12.0

v0.10.5

24 Dec 10:21
cf0617b

Choose a tag to compare

What's Changed

Fixed

  • Keep creating service monitors for deprecated prometheus instances #775 itsomri
  • Fix retention duration parsing for deprecated prometheus instances #775 itsomri
  • Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table #757

Full Changelog: v0.10.4...v0.10.5

v0.10.4

23 Dec 10:18
461e81b

Choose a tag to compare

What's Changed

Added

  • Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
  • Added an option to disable prometheus persistency #765 itsomri

Changed

  • If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #765 itsomri
  • When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #765 itsomri

Fixed

  • Fixed reclaim/preempt/consolidate actions for topology workloads #748 itsomri

Full Changelog: v0.10.3...v0.10.4

v0.10.3

09 Dec 10:35
6f8ecfa

Choose a tag to compare

What's Changed

  • chore(ci): move Docker data and image cache to /mnt for more disk space by @gshaibi in #718
  • feat(ci): changelog validation by @gshaibi in #722
  • test: check flakey test by @github-actions[bot] in #730
  • Resource division log usage v0.10 by @itsomri in #734
  • feat: default preemtibility from configmap by @natasharomm in #736

Full Changelog: v0.10.2...v0.10.3

v0.9.9

07 Dec 22:34
114e59f

Choose a tag to compare

What's Changed

  • chore(ci): move Docker data and image cache to /mnt for more disk space by @gshaibi in #719
  • feat(ci): changelog validation by @gshaibi in #723
  • fix(api,operator): make KaiConfig conditions compatible with helm 4 wait logic by @gshaibi in #710

Full Changelog: v0.9.8...v0.9.9

v0.10.2

25 Nov 14:41
97cf9c2

Choose a tag to compare

What's Changed

Fixed

  • Removed the requirement to specify container type for init container gpu fractions #684 itsomri
  • When a status update for a podGroup in the scheduler is flushed due to update conflict, delete the update payload data as well #691 davidLif
  • Fixed scheduling shard cleanup on helm uninstall #678 srujanreddya

New Contributors

Full Changelog: v0.10.1...v0.10.2

v0.10.1

23 Nov 11:05
883b1f9

Choose a tag to compare

What's Changed

Fixed

  • Fixed scheduler pod group status update conflict #676 davidLif
  • Fixed gpu request validations for pods #660 itsomri

Changed

  • Dependabot configuration to update actions in workflows #651 ScottBrenner
  • optimize dependency management by using module cache instead of vendor directory #645 lokielse

New Contributors

Full Changelog: v0.10.0...v0.10.1

v0.10.0

18 Nov 11:26
ad3ac0e

Choose a tag to compare

What's Changed

Added

  • Added parent reference to SubGroup struct in PodGroup CRD to allow a hierarchical SubGroup structure
  • Added time aware scheduling capabilities
  • Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
  • Added the option to configure the names of the webhook configuration resources
  • Added an option to configure reservation pods runtime class
  • Added enforcement of the nvidia runtime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely
  • Added a preferred podAntiAffinity term by default for all KAI system services, can be set to required instead by setting global.requireDefaultPodAffinityTerm
  • Added support for service-level affinities
  • Added option to specify container name and type for fraction containers

Fixed

  • (Openshift only) - High CPU usage for the operator pod due to continues reconciles
  • Fixed a bug where the scheduler would not re-try updating podgroup status after failure
  • Fixed a bug where ray workloads gang scheduling would ignore minReplicas if autoscaling was not set
  • Fixed wrong status when prometheus operand is enabled in KAI Config
  • GPU-Operator v25.10.0 support for CDI enabled environments

v0.10.0-rc6

16 Nov 12:12
2297956

Choose a tag to compare

v0.10.0-rc6 Pre-release
Pre-release

What's Changed

  • ci: Extend the amount of ci nodes of the kind cluster used in the "Validate & test" step by @davidLif in #618
  • fix: scheduling shards docs and defaults by @enoodle in #622
  • fix(chart): scope CRD manager permissions to specific resource names by @lokielse in #631
  • fix(chart): Protect resource rendering when resources value is null by @lokielse in #630
  • feat(chart): Add flexible image tag configuration with priority-based overrides by @lokielse in #628
  • refactor: Fix serialization of conf object by @itsomri in #633
  • fix: Prometheus operand by @itsomri in #634
  • refactor(operator): prometheus operand by @enoodle in #629
  • feat(binder): specify CPU and memory requests and limits for GPU reservation pod by @lokielse in #626
  • fix(operator): idempotent sa image pull secrets by @enoodle in #637
  • feat: Time aware configs in scheduling shard by @itsomri in #635
  • feat(podgrouper): Publish the pod-grouper DefaultPluginsHub by @davidLif in #632
  • fix(operator): support latest gpu operator cdi detection by @enoodle in #641
  • ci: add support for custom GOPROXY and GOSUMDB in Docker environment by @lokielse in #643
  • docs: fix quickstart and queue docs by @enoodle in #642
  • docs: add missing default queues example by @enoodle in #648
  • ci: add auto-generated comments to RBAC and CRD YAML files by @lokielse in #644

New Contributors

Full Changelog: v0.10.0-rc5...v0.10.0-rc6

v0.9.8

14 Nov 11:35
96fa84d

Choose a tag to compare

What's Changed

  • fix; openshift operator sa reconcile by @enoodle in #646
  • fix: 0.9 - support gpu operator 25.10.0 better by @enoodle in #647

Full Changelog: v0.9.7...v0.9.8