Releases · NVIDIA/KAI-Scheduler

24 Dec 14:41

itsomri

v0.12.0

c2e76fb

v0.12.0

Note: v0.12 is a stable release as described in SUPPORT.md

What's Changed

Added

Introduced native KAI Topology CRD to replace dependency on Kueue's Topology CRD, improving compatibility and simplifying installation
Added support for having the default "preemptibility" per top-owner-type read from the default configs configmap in the pod-grouper
Added option to profile CPU when running the snapshot tool #726 itsomri
GPU resource bookkeeping for DRA enabled resources
Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
Added an option to disable prometheus persistency #764 itsomri

Changed

If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #756 itsomri
When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #756 itsomri

Fixed

Fixed a bug where the snapshot tool would not load topology objects #720 itsomri
Operator to conditionally watch ClusterPolicy based on its existence, preventing errors in its absence
Fixed confusing resource division log message #733 itsomri
Made post-delete-cleanup resources configurable #737 dttung2905
GPU Memory pods are not reclaimed or consolidated correctly
Added missing leases permission for the operator #753 dttung2905
Fixed reclaim/preempt/consolidate actions for topology workloads #739 itsomri
Fixed a bug where the scheduler would not consider topology constraints when calculating the scheduling constraints signature #761 gshaibi
Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table
Keep creating service monitors for deprecated prometheus instances #774 itsomri
Fix retention duration parsing for deprecated prometheus instances #774 itsomri

Changed

Renamed the previous "tumbling" option for the scheduler usage window type to "cron".

New Contributors

@dttung2905 made their first contribution in #737

Full Changelog: v0.10.2...v0.12.0

Contributors

dttung2905

Assets 3

24 Dec 10:21

SiorMeir

v0.10.5

cf0617b

v0.10.5

What's Changed

Fixed

Keep creating service monitors for deprecated prometheus instances #775 itsomri
Fix retention duration parsing for deprecated prometheus instances #775 itsomri
Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table #757

Full Changelog: v0.10.4...v0.10.5

Assets 3

23 Dec 10:18

itsomri

v0.10.4

461e81b

v0.10.4

What's Changed

Added

Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
Added an option to disable prometheus persistency #765 itsomri

Changed

If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #765 itsomri
When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #765 itsomri

Fixed

Fixed reclaim/preempt/consolidate actions for topology workloads #748 itsomri

Full Changelog: v0.10.3...v0.10.4

Assets 3

09 Dec 10:35

natasharomm

v0.10.3

6f8ecfa

v0.10.3

What's Changed

chore(ci): move Docker data and image cache to /mnt for more disk space by @gshaibi in #718
feat(ci): changelog validation by @gshaibi in #722
test: check flakey test by @github-actions[bot] in #730
Resource division log usage v0.10 by @itsomri in #734
feat: default preemtibility from configmap by @natasharomm in #736

Full Changelog: v0.10.2...v0.10.3

Contributors

gshaibi, natasharomm, and itsomri

Assets 3

07 Dec 22:34

gshaibi

v0.9.9

114e59f

v0.9.9

What's Changed

chore(ci): move Docker data and image cache to /mnt for more disk space by @gshaibi in #719
feat(ci): changelog validation by @gshaibi in #723
fix(api,operator): make KaiConfig conditions compatible with helm 4 wait logic by @gshaibi in #710

Full Changelog: v0.9.8...v0.9.9

Contributors

gshaibi

Assets 3

25 Nov 14:41

itsomri

v0.10.2

97cf9c2

v0.10.2

What's Changed

Fixed

Removed the requirement to specify container type for init container gpu fractions #684 itsomri
When a status update for a podGroup in the scheduler is flushed due to update conflict, delete the update payload data as well #691 davidLif
Fixed scheduling shard cleanup on helm uninstall #678 srujanreddya

New Contributors

@srujanreddya made their first contribution in #678

Full Changelog: v0.10.1...v0.10.2

Contributors

srujanreddya

Assets 3

23 Nov 11:05

davidLif

v0.10.1

883b1f9

v0.10.1

What's Changed

Fixed

Fixed scheduler pod group status update conflict #676 davidLif
Fixed gpu request validations for pods #660 itsomri

Changed

Dependabot configuration to update actions in workflows #651 ScottBrenner
optimize dependency management by using module cache instead of vendor directory #645 lokielse

New Contributors

@ScottBrenner made their first contribution in #651

Full Changelog: v0.10.0...v0.10.1

Contributors

ScottBrenner

Assets 3

18 Nov 11:26

itsomri

v0.10.0

ad3ac0e

v0.10.0

What's Changed

Added

Added parent reference to SubGroup struct in PodGroup CRD to allow a hierarchical SubGroup structure
Added time aware scheduling capabilities
Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
Added the option to configure the names of the webhook configuration resources
Added an option to configure reservation pods runtime class
Added enforcement of the nvidia runtime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely
Added a preferred podAntiAffinity term by default for all KAI system services, can be set to required instead by setting global.requireDefaultPodAffinityTerm
Added support for service-level affinities
Added option to specify container name and type for fraction containers

Fixed

(Openshift only) - High CPU usage for the operator pod due to continues reconciles
Fixed a bug where the scheduler would not re-try updating podgroup status after failure
Fixed a bug where ray workloads gang scheduling would ignore minReplicas if autoscaling was not set
Fixed wrong status when prometheus operand is enabled in KAI Config
GPU-Operator v25.10.0 support for CDI enabled environments

Assets 3

16 Nov 12:12

davidLif

v0.10.0-rc6

2297956

v0.10.0-rc6 Pre-release

Pre-release

What's Changed

ci: Extend the amount of ci nodes of the kind cluster used in the "Validate & test" step by @davidLif in #618
fix: scheduling shards docs and defaults by @enoodle in #622
fix(chart): scope CRD manager permissions to specific resource names by @lokielse in #631
fix(chart): Protect resource rendering when resources value is null by @lokielse in #630
feat(chart): Add flexible image tag configuration with priority-based overrides by @lokielse in #628
refactor: Fix serialization of conf object by @itsomri in #633
fix: Prometheus operand by @itsomri in #634
refactor(operator): prometheus operand by @enoodle in #629
feat(binder): specify CPU and memory requests and limits for GPU reservation pod by @lokielse in #626
fix(operator): idempotent sa image pull secrets by @enoodle in #637
feat: Time aware configs in scheduling shard by @itsomri in #635
feat(podgrouper): Publish the pod-grouper DefaultPluginsHub by @davidLif in #632
fix(operator): support latest gpu operator cdi detection by @enoodle in #641
ci: add support for custom GOPROXY and GOSUMDB in Docker environment by @lokielse in #643
docs: fix quickstart and queue docs by @enoodle in #642
docs: add missing default queues example by @enoodle in #648
ci: add auto-generated comments to RBAC and CRD YAML files by @lokielse in #644

New Contributors

@lokielse made their first contribution in #631

Full Changelog: v0.10.0-rc5...v0.10.0-rc6

Contributors

lokielse, enoodle, and 2 other contributors

Assets 3

14 Nov 11:35

enoodle

v0.9.8

96fa84d

v0.9.8

What's Changed

fix; openshift operator sa reconcile by @enoodle in #646
fix: 0.9 - support gpu operator 25.10.0 better by @enoodle in #647

Full Changelog: v0.9.7...v0.9.8

Contributors

enoodle

Assets 3

Releases: NVIDIA/KAI-Scheduler

v0.12.0

What's Changed

Added

Changed

Fixed

Changed

New Contributors

Contributors

Uh oh!

v0.10.5

What's Changed

Fixed

Uh oh!

v0.10.4

What's Changed

Added

Changed

Fixed

Uh oh!

v0.10.3

What's Changed

Contributors

Uh oh!

v0.9.9

What's Changed

Contributors

Uh oh!

v0.10.2

What's Changed

Fixed

New Contributors

Contributors

Uh oh!

v0.10.1

What's Changed

Fixed

Changed

New Contributors

Contributors

Uh oh!

v0.10.0

What's Changed

Added

Fixed

Uh oh!

v0.10.0-rc6

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.8

What's Changed

Contributors

Uh oh!