Skip to content

v0.1.1

Latest

Choose a tag to compare

@shogom2 shogom2 released this 05 Feb 08:15
· 26 commits to main since this release
858f6f0

Bug Fixes

  • fix: avoid duplicate addition of desiredTaint
  • fix: allow error messages to be reported in ComposableResource
  • feat: update support for RKE2 and K8S
  • fix: modify GetResources() in CM/FM
  • fix: make ready-to-detach composableResource independent of ComposabilityRequest controller
  • fix: let ComposableResource detach skip some cases where no pod is found
  • fix: correct FM API error handling and update test set
  • feat: support creating and deleting DeviceTaintRule resources
  • fix: add FM resource existence check before sending delete request and update test set
  • fix: reduce RestartDaemonset wait time to 10s
  • chore: log token refresh events
  • fix: change parsing method for res_op_status in cm/client.go
  • fix: resolve bug in DrainGPU function and add completion log to RunNvidiaSmi
  • fix: correct bug where CDIDeviceID was passed as data to FM API
  • fix: correct missing information in ComposableResource created by Upstream Syncer
  • fix: extend FM API timeout from 1 minute to 3 minutes
  • fix: assign DeletionTimestamp instead of force-detaching resources that are ready for detach; add garbage collection for ComposabilityRequest and ComposableResource and update related test sets
  • fix: remove SetNodeSchedulable and update test sets
  • refactor: create DeviceTaintRule using driver/pool/device instead of CEL
  • refactor: detach GPUs via the cro-node-agent Pod instead of the nvidia-dra-driver-gpu-kubelet-plugin Pod; improve fault tolerance when GPU detachment fails; update test sets accordingly
  • fix: Issue where processes using the /devnvidiaX file were incorrectly identified as errors when they had already been killed
  • fix: Move the process for reading FTI_CDI-specific environment variable into a switch/case statement
  • fix: Add the processing to execute the CheckGPUVisible function to the handleDetachingState function
  • fix: remove unnecessary sleep and node label checks; add /proc scan in DrainGPU and CheckNoGPULoads

What's Changed

  • Fix multiple issues related to API responses and GPU attach/detach handling by @NekoHK in #14

New Contributors

Full Changelog: v0.1.0...v0.1.1