Add retry logic when updating node labels

After uninstalling a previously installed GPU driver, the k8s-driver-manager attempts to reschedule all of the GPU Operator operands by updating node labels. If this operation fails, we do not treat this as a fatal error and instead just emit a warning message: https://github.com/NVIDIA/k8s-driver-manager/blob/a837602b5d1a5c3a62ea2166a19d71fea2cf0984/cmd/driver-manager/main.go#L369. Because of this, it is possible for the GPU Operator state machine to enter a deadlock where the new driver daemonset has come up but all the operands never get rescheduled (if a transient error occurs when attempting to update node labels).

At minimum, we should treat this as a fatal error and have the k8s-driver-manager exit with a non-zero exit code. This way, the k8s-driver-manager will restart and attempt to update the node labels again. We could also consider adding restart logic to the code itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add retry logic when updating node labels #118

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add retry logic when updating node labels #118

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions