Skip to content

Add retry logic when updating node labels #118

@cdesiniotis

Description

@cdesiniotis

After uninstalling a previously installed GPU driver, the k8s-driver-manager attempts to reschedule all of the GPU Operator operands by updating node labels. If this operation fails, we do not treat this as a fatal error and instead just emit a warning message:

dm.log.Warnf("Failed to reschedule GPU operator components: %v", err)
. Because of this, it is possible for the GPU Operator state machine to enter a deadlock where the new driver daemonset has come up but all the operands never get rescheduled (if a transient error occurs when attempting to update node labels).

At minimum, we should treat this as a fatal error and have the k8s-driver-manager exit with a non-zero exit code. This way, the k8s-driver-manager will restart and attempt to update the node labels again. We could also consider adding restart logic to the code itself.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions