Skip to content

Releases: NVIDIA/KAI-Scheduler

v0.5.4

10 Jun 08:26
192f18c

Choose a tag to compare

What's Changed

  • Fix scheduler pod group status synchronization between incoming update and in-cluster data - v0.5 cherrypick by @davidLif in #222

Full Changelog: v0.5.3...v0.5.4

v0.4.10

10 Jun 08:28
4c913b1

Choose a tag to compare

What's Changed

  • Fix scheduler pod group status synchronization between incoming update and in-cluster data - v0.4 cherrypick by @davidLif in #223

Full Changelog: v0.4.9...v0.4.10

v0.5.3

27 May 16:51
0262a65

Choose a tag to compare

What's Changed

  • Pre creating binding request, delete any pending status updates for t… by @davidLif in #185
  • Use more peek and Fix for the implementation of popNextJob instead of… by @davidLif in #190

Full Changelog: v0.5.2...v0.5.3

v0.4.9

27 May 08:15
7cd0616

Choose a tag to compare

What's Changed

  • chrry-pick to v0.4: Changed to regular ubuntu docker image in tests (#123) by @enoodle in #188
  • cherrypick v0.4 - Pre creating binding request, delete any pending status updates for t… by @davidLif in #186
  • Don't add nodepol label for empty nodepool by @itsomri in #183

Full Changelog: v0.4.8...v0.4.9

v0.5.2

26 May 14:35
ef61dcd

Choose a tag to compare

What's Changed

Full Changelog: v0.5.1...v0.5.2

v0.5.1

19 May 22:50
ca8aaaa

Choose a tag to compare

What's Changed

  • Scheduling gates by @itsomri in #122
  • [Bugfix] Reclaim victim queue order by @itsomri in #62
  • make nodeSelector, affinity and tolerations configurable by @gflatters in #127
  • Replace run.ai string with kai.scheduler by @omer-dayan in #131
  • Made scheduling queue label key configurable by @romanbaron in #129
  • Cache GetDeservedShare and GetFairShare by @davidLif in #139
  • refactor(binder): add cache to the resource reservation client by @enoodle in #144
  • Remove requirement for Worker when using PyTorchJob by @Phlip79 in #149
  • PodGroup info caching for some of the results by @davidLif in #138
  • refactor(scheduler): patch pod labels concurrently by @enoodle in #147
  • [Design] Minimum runtime before preemptions and reclaims by @ArmedGuy in #126
  • scheduler: bugfix: Make pod_scenario_builder build scenarios for rest of elastic job by @ArmedGuy in #132

New Contributors

Full Changelog: v0.5.0...v0.5.1

v0.4.8

16 May 09:29

Choose a tag to compare

Fixed

  • Queue order function now takes into account potential victims, resulting in better reclaim scenarios.

CHANGED

  • Cached GetDeservedShare and GetFairShare function in the scheduler PodGroupInfo to improve performance.
  • Added cache to the binder resource reservation client.
  • More Caching and improvements to PodGroupInfo class.
  • Update pod labels after scheduling decision concurrently in the background.

v0.5.0

08 May 08:02
4607d46

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.4.7...v0.5.0

v0.4.7

29 Apr 17:16
73e280a

Choose a tag to compare

What's Changed

  • fix: snapshot tool cache.Run call by @enoodle in #102
  • Docs hotfix: Update and rename pytorch-elasitc.yaml to pytorch-elastic.yaml by @EkinKarabulut in #105
  • Adding Issue Templates for bug & feature/enhancement requests by @EkinKarabulut in #103
  • fix: gpu resource device count calculation by @enoodle in #107

Full Changelog: v0.4.6...v0.4.7

v0.4.6

24 Apr 09:03
ead9f27

Choose a tag to compare

What's Changed

Full Changelog: v0.4.5...v0.4.6