Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

karthikvetrivel · 2025-10-02T15:48:15Z

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

Check if NVIDIA kernel modules are loaded to avoid modprobe gpu-operator#1746
Add fast-track to skip uninstall/install if NVIDIA driver modules present gpu-driver-container#454

Decision Tree:

shouldSkipUninstall()
│
├─> [1] config.forceReinstall == true?
│   ├─> YES → ❌ PROCEED WITH UNINSTALL
│   │           return (false, "")
│   │           Log: "Force reinstall is enabled, proceeding with driver uninstall"
│   │
│   └─> NO → Continue to [2]
│
├─> [2] isDriverLoaded()?
│   │   Check: /sys/module/nvidia/refcnt exists
│   │
│   ├─> NO (modules not loaded) → ❌ PROCEED WITH UNINSTALL (we want to be thorough with cleanup)
│   │                               return (false, "")
│   │                               (No log message)
│   │
│   └─> YES → Continue to [3]
│
├─> [3] config.driverVersion == ""?
│   │
│   ├─> YES (empty) → ❌ PROCEED WITH UNINSTALL
│   │                  return (false, "")
│   │                  (No log message)
│   │
│   └─> NO → Continue to [4]
│
├─> [4] detectCurrentDriverVersion()
│   │   Try method 1: chroot /run/nvidia/driver modinfo -F version nvidia
│   │   Try method 2: cat /sys/module/nvidia/version
│   │
│   ├─> ERROR (can't detect version) → ❌ PROCEED WITH UNINSTALL
│   │                                    return (false, "")
│   │                                    Log: "Unable to determine installed driver version: <err>"
│   │                                    Log: "Cannot verify driver version, proceeding with reinstall..."
│   │
│   └─> SUCCESS (version detected) → Continue to [5]
│       │   Log: "Driver version detected via chroot: X.Y.Z"
│       │     OR "Driver version detected from /sys/module/nvidia/version: X.Y.Z"
│
├─> [5] detected_version == desired_version?
│   │
│   ├─> NO (version mismatch) → ❌ PROCEED WITH UNINSTALL
│   │                            return (false, "")
│   │                            Log: "Installed driver version X does not match desired Y, 
│   │                                  proceeding with uninstall"
│   │
│   └─> YES (versions match) → ✅ SKIP UNINSTALL
│                               return (true, "desired version already present")
│                               Log: "Installed driver version X matches desired version, 
│                                     skipping uninstall"


LEGEND:
═══════
❌ PROCEED WITH UNINSTALL = return (false, "") → uninstallDriver() continues
✅ SKIP UNINSTALL = return (true, "reason") → uninstallDriver() returns nil early


SCENARIOS MAPPED TO TREE:
═════════════════════════

Scenario 1: Standard Clean Restart (no modules)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=FALSE → ❌ PROCEED (full uninstall runs)

Scenario 2: Non-Clean Restart (modules loaded, version matches)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=TRUE → Continue
  [3] driverVersion="580.82.07" → Continue
  [4] detectCurrentDriverVersion()="580.82.07" → Continue
  [5] "580.82.07" == "580.82.07" → ✅ SKIP (early exit, no cleanup)

Scenario 3: Version Mismatch (upgrade needed)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=TRUE → Continue
  [3] driverVersion="580.82.14" → Continue
  [4] detectCurrentDriverVersion()="580.82.07" → Continue
  [5] "580.82.07" != "580.82.14" → ❌ PROCEED (full uninstall runs)


RETURN VALUES:
═════════════

(false, "") → Don't skip, proceed with full uninstall
(true, "desired version already present") → Skip uninstall, exit early```

cmd/driver-manager/main.go

cdesiniotis · 2025-10-07T17:16:02Z

cmd/driver-manager/main.go


+	if skip, reason := dm.shouldSkipUninstall(); skip {
+		dm.log.Infof("Skipping driver uninstall: %s", reason)
+		return nil


Returning early here can be problematic. Even if no nvidia modules are loaded, there are other operations that get executed below, like handling the vfio-pci driver unloading and waiting for MOFED, that are still relevant. We may want to refactor the control flow here.

Great point! On second look, I've refactored this so that the our stateless operations (vfio-pci, MOFED, nouveau) always run and the stateful ones (uncordon, reschedule) only run when we're tearing down.

EDIT: this actually broke the code. The Daemonset is able to go back to running state & functional w/o this change so I reverted it.

this actually broke the code.

What was the behavior you observed (if you recall still)?

I don't exactly remember but the DS couldn't achieve a running state. I think this makes sense though, forcing those “stateless” steps (e.g. vfio-pci) even during the skip path left the GPUs unbound from the driver, so the driver pod never recovered.

Ah yes -- performing the unbind is problematic as the driver container (as you have implemented it currently) will not re-load the module when it starts again, doh!

Signed-off-by: Karthik Vetrivel <[email protected]>

…sure mount refresh Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel · 2025-10-30T00:20:07Z

@cdesiniotis I've moved shouldSkipUninstall so that the operands still release /run/nvidia/driver mounts.

karthikvetrivel marked this pull request as draft October 2, 2025 15:48

This was referenced Oct 2, 2025

Add fast-track to skip uninstall/install if NVIDIA driver modules present NVIDIA/gpu-driver-container#454

Open

Check if NVIDIA kernel modules are loaded to avoid modprobe NVIDIA/gpu-operator#1746

Draft

cdesiniotis reviewed Oct 7, 2025

View reviewed changes

Add shouldSkipUninstall to avoid GPU driver teardown on restart

1991b8c

Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 68adf6a to 1991b8c Compare October 16, 2025 15:11

karthikvetrivel marked this pull request as ready for review October 16, 2025 15:19

karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 900f54b to 1991b8c Compare October 20, 2025 20:53

refactor: move shouldSkipUninstall check after operand eviction to en…

37ae9a0

…sure mount refresh Signed-off-by: Karthik Vetrivel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

Uh oh!

karthikvetrivel commented Oct 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

cdesiniotis Oct 7, 2025

Uh oh!

karthikvetrivel Oct 20, 2025 •

edited

Loading

Uh oh!

cdesiniotis Oct 29, 2025

Uh oh!

karthikvetrivel Oct 29, 2025 •

edited

Loading

Uh oh!

cdesiniotis Oct 29, 2025

Uh oh!

karthikvetrivel commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

Are you sure you want to change the base?

Add shouldSkipUninstall to avoid GPU driver teardown on restart #103

Uh oh!

Conversation

karthikvetrivel commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cdesiniotis Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cdesiniotis Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

karthikvetrivel commented Oct 2, 2025 •

edited

Loading

karthikvetrivel Oct 20, 2025 •

edited

Loading

karthikvetrivel Oct 29, 2025 •

edited

Loading