Bug Fixes
- fix: avoid duplicate addition of desiredTaint
- fix: allow error messages to be reported in ComposableResource
- feat: update support for RKE2 and K8S
- fix: modify GetResources() in CM/FM
- fix: make ready-to-detach composableResource independent of ComposabilityRequest controller
- fix: let ComposableResource detach skip some cases where no pod is found
- fix: correct FM API error handling and update test set
- feat: support creating and deleting DeviceTaintRule resources
- fix: add FM resource existence check before sending delete request and update test set
- fix: reduce RestartDaemonset wait time to 10s
- chore: log token refresh events
- fix: change parsing method for res_op_status in cm/client.go
- fix: resolve bug in DrainGPU function and add completion log to RunNvidiaSmi
- fix: correct bug where CDIDeviceID was passed as data to FM API
- fix: correct missing information in ComposableResource created by Upstream Syncer
- fix: extend FM API timeout from 1 minute to 3 minutes
- fix: assign DeletionTimestamp instead of force-detaching resources that are ready for detach; add garbage collection for ComposabilityRequest and ComposableResource and update related test sets
- fix: remove SetNodeSchedulable and update test sets
- refactor: create DeviceTaintRule using driver/pool/device instead of CEL
- refactor: detach GPUs via the cro-node-agent Pod instead of the nvidia-dra-driver-gpu-kubelet-plugin Pod; improve fault tolerance when GPU detachment fails; update test sets accordingly
- fix: Issue where processes using the /devnvidiaX file were incorrectly identified as errors when they had already been killed
- fix: Move the process for reading FTI_CDI-specific environment variable into a switch/case statement
- fix: Add the processing to execute the CheckGPUVisible function to the handleDetachingState function
- fix: remove unnecessary sleep and node label checks; add /proc scan in DrainGPU and CheckNoGPULoads
What's Changed
New Contributors
Full Changelog: v0.1.0...v0.1.1