-
Notifications
You must be signed in to change notification settings - Fork 17
Add shouldSkipUninstall to avoid GPU driver teardown on restart #103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add shouldSkipUninstall to avoid GPU driver teardown on restart #103
Conversation
cmd/driver-manager/main.go
Outdated
|
|
||
| if skip, reason := dm.shouldSkipUninstall(); skip { | ||
| dm.log.Infof("Skipping driver uninstall: %s", reason) | ||
| return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning early here can be problematic. Even if no nvidia modules are loaded, there are other operations that get executed below, like handling the vfio-pci driver unloading and waiting for MOFED, that are still relevant. We may want to refactor the control flow here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! On second look, I've refactored this so that the our stateless operations (vfio-pci, MOFED, nouveau) always run and the stateful ones (uncordon, reschedule) only run when we're tearing down.
EDIT: this actually broke the code. The Daemonset is able to go back to running state & functional w/o this change so I reverted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this actually broke the code.
What was the behavior you observed (if you recall still)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't exactly remember but the DS couldn't achieve a running state. I think this makes sense though, forcing those “stateless” steps (e.g. vfio-pci) even during the skip path left the GPUs unbound from the driver, so the driver pod never recovered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes -- performing the unbind is problematic as the driver container (as you have implemented it currently) will not re-load the module when it starts again, doh!
Signed-off-by: Karthik Vetrivel <[email protected]>
68adf6a to
1991b8c
Compare
900f54b to
1991b8c
Compare
…sure mount refresh Signed-off-by: Karthik Vetrivel <[email protected]>
|
@cdesiniotis I've moved |
This PR is a part of this endeavor:
Relevant PRs:
Decision Tree: