Releases: NVIDIA/KAI-Scheduler
Releases · NVIDIA/KAI-Scheduler
v0.12.0
Note: v0.12 is a stable release as described in SUPPORT.md
What's Changed
Added
- Introduced native KAI Topology CRD to replace dependency on Kueue's Topology CRD, improving compatibility and simplifying installation
- Added support for having the default "preemptibility" per top-owner-type read from the default configs configmap in the pod-grouper
- Added option to profile CPU when running the snapshot tool #726 itsomri
- GPU resource bookkeeping for DRA enabled resources
- Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
- Added an option to disable prometheus persistency #764 itsomri
Changed
- If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #756 itsomri
- When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #756 itsomri
Fixed
- Fixed a bug where the snapshot tool would not load topology objects #720 itsomri
- Operator to conditionally watch ClusterPolicy based on its existence, preventing errors in its absence
- Fixed confusing resource division log message #733 itsomri
- Made post-delete-cleanup resources configurable #737 dttung2905
- GPU Memory pods are not reclaimed or consolidated correctly
- Added missing leases permission for the operator #753 dttung2905
- Fixed reclaim/preempt/consolidate actions for topology workloads #739 itsomri
- Fixed a bug where the scheduler would not consider topology constraints when calculating the scheduling constraints signature #761 gshaibi
- Fixed Dynamo integration by adding Dynamo GVKs to SkipTopOwner table
- Keep creating service monitors for deprecated prometheus instances #774 itsomri
- Fix retention duration parsing for deprecated prometheus instances #774 itsomri
Changed
- Renamed the previous "tumbling" option for the scheduler usage window type to "cron".
New Contributors
- @dttung2905 made their first contribution in #737
Full Changelog: v0.10.2...v0.12.0
v0.10.5
v0.10.4
What's Changed
Added
- Add a "tumbling window" usage configuration - calculate a tumbling window size based on a start timne configuration and a duration config field.
- Added an option to disable prometheus persistency #765 itsomri
Changed
- If enabled, prometheus storage size is not inferred from cluster objects, but defaults to 50Gi unless explicitly set in KAI config #765 itsomri
- When prometheus is disabled, it will remain in the cluster for a grace period equal to it's retention, unless re-enabled #765 itsomri
Fixed
Full Changelog: v0.10.3...v0.10.4
v0.10.3
What's Changed
- chore(ci): move Docker data and image cache to /mnt for more disk space by @gshaibi in #718
- feat(ci): changelog validation by @gshaibi in #722
- test: check flakey test by @github-actions[bot] in #730
- Resource division log usage v0.10 by @itsomri in #734
- feat: default preemtibility from configmap by @natasharomm in #736
Full Changelog: v0.10.2...v0.10.3
v0.9.9
v0.10.2
What's Changed
Fixed
- Removed the requirement to specify container type for init container gpu fractions #684 itsomri
- When a status update for a podGroup in the scheduler is flushed due to update conflict, delete the update payload data as well #691 davidLif
- Fixed scheduling shard cleanup on helm uninstall #678 srujanreddya
New Contributors
- @srujanreddya made their first contribution in #678
Full Changelog: v0.10.1...v0.10.2
v0.10.1
What's Changed
Fixed
- Fixed scheduler pod group status update conflict #676 davidLif
- Fixed gpu request validations for pods #660 itsomri
Changed
- Dependabot configuration to update actions in workflows #651 ScottBrenner
- optimize dependency management by using module cache instead of vendor directory #645 lokielse
New Contributors
- @ScottBrenner made their first contribution in #651
Full Changelog: v0.10.0...v0.10.1
v0.10.0
What's Changed
Added
- Added parent reference to SubGroup struct in PodGroup CRD to allow a hierarchical SubGroup structure
- Added time aware scheduling capabilities
- Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
- Added the option to configure the names of the webhook configuration resources
- Added an option to configure reservation pods runtime class
- Added enforcement of the
nvidiaruntime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely - Added a preferred podAntiAffinity term by default for all KAI system services, can be set to required instead by setting
global.requireDefaultPodAffinityTerm - Added support for service-level affinities
- Added option to specify container name and type for fraction containers
Fixed
- (Openshift only) - High CPU usage for the operator pod due to continues reconciles
- Fixed a bug where the scheduler would not re-try updating podgroup status after failure
- Fixed a bug where ray workloads gang scheduling would ignore
minReplicasif autoscaling was not set - Fixed wrong status when prometheus operand is enabled in KAI Config
- GPU-Operator v25.10.0 support for CDI enabled environments
v0.10.0-rc6
What's Changed
- ci: Extend the amount of ci nodes of the kind cluster used in the "Validate & test" step by @davidLif in #618
- fix: scheduling shards docs and defaults by @enoodle in #622
- fix(chart): scope CRD manager permissions to specific resource names by @lokielse in #631
- fix(chart): Protect resource rendering when resources value is null by @lokielse in #630
- feat(chart): Add flexible image tag configuration with priority-based overrides by @lokielse in #628
- refactor: Fix serialization of conf object by @itsomri in #633
- fix: Prometheus operand by @itsomri in #634
- refactor(operator): prometheus operand by @enoodle in #629
- feat(binder): specify CPU and memory requests and limits for GPU reservation pod by @lokielse in #626
- fix(operator): idempotent sa image pull secrets by @enoodle in #637
- feat: Time aware configs in scheduling shard by @itsomri in #635
- feat(podgrouper): Publish the pod-grouper DefaultPluginsHub by @davidLif in #632
- fix(operator): support latest gpu operator cdi detection by @enoodle in #641
- ci: add support for custom GOPROXY and GOSUMDB in Docker environment by @lokielse in #643
- docs: fix quickstart and queue docs by @enoodle in #642
- docs: add missing default queues example by @enoodle in #648
- ci: add auto-generated comments to RBAC and CRD YAML files by @lokielse in #644
New Contributors
Full Changelog: v0.10.0-rc5...v0.10.0-rc6