-
Notifications
You must be signed in to change notification settings - Fork 17
Open
Description
After uninstalling a previously installed GPU driver, the k8s-driver-manager attempts to reschedule all of the GPU Operator operands by updating node labels. If this operation fails, we do not treat this as a fatal error and instead just emit a warning message:
k8s-driver-manager/cmd/driver-manager/main.go
Line 369 in a837602
| dm.log.Warnf("Failed to reschedule GPU operator components: %v", err) |
At minimum, we should treat this as a fatal error and have the k8s-driver-manager exit with a non-zero exit code. This way, the k8s-driver-manager will restart and attempt to update the node labels again. We could also consider adding restart logic to the code itself.
Metadata
Metadata
Assignees
Labels
No labels