Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Oct 2, 2025

This PR is a part of this endeavor:

GPU Driver container should avoid re-installing drivers on spurious container restarts

Relevant PRs:

Decision Tree:

shouldSkipUninstall()
│
├─> [1] config.forceReinstall == true?
│   ├─> YES → ❌ PROCEED WITH UNINSTALL
│   │           return (false, "")
│   │           Log: "Force reinstall is enabled, proceeding with driver uninstall"
│   │
│   └─> NO → Continue to [2]
│
├─> [2] isDriverLoaded()?
│   │   Check: /sys/module/nvidia/refcnt exists
│   │
│   ├─> NO (modules not loaded) → ❌ PROCEED WITH UNINSTALL (we want to be thorough with cleanup)
│   │                               return (false, "")
│   │                               (No log message)
│   │
│   └─> YES → Continue to [3]
│
├─> [3] config.driverVersion == ""?
│   │
│   ├─> YES (empty) → ❌ PROCEED WITH UNINSTALL
│   │                  return (false, "")
│   │                  (No log message)
│   │
│   └─> NO → Continue to [4]
│
├─> [4] detectCurrentDriverVersion()
│   │   Try method 1: chroot /run/nvidia/driver modinfo -F version nvidia
│   │   Try method 2: cat /sys/module/nvidia/version
│   │
│   ├─> ERROR (can't detect version) → ❌ PROCEED WITH UNINSTALL
│   │                                    return (false, "")
│   │                                    Log: "Unable to determine installed driver version: <err>"
│   │                                    Log: "Cannot verify driver version, proceeding with reinstall..."
│   │
│   └─> SUCCESS (version detected) → Continue to [5]
│       │   Log: "Driver version detected via chroot: X.Y.Z"
│       │     OR "Driver version detected from /sys/module/nvidia/version: X.Y.Z"
│
├─> [5] detected_version == desired_version?
│   │
│   ├─> NO (version mismatch) → ❌ PROCEED WITH UNINSTALL
│   │                            return (false, "")
│   │                            Log: "Installed driver version X does not match desired Y, 
│   │                                  proceeding with uninstall"
│   │
│   └─> YES (versions match) → ✅ SKIP UNINSTALL
│                               return (true, "desired version already present")
│                               Log: "Installed driver version X matches desired version, 
│                                     skipping uninstall"


LEGEND:
═══════
❌ PROCEED WITH UNINSTALL = return (false, "") → uninstallDriver() continues
✅ SKIP UNINSTALL = return (true, "reason") → uninstallDriver() returns nil early


SCENARIOS MAPPED TO TREE:
═════════════════════════

Scenario 1: Standard Clean Restart (no modules)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=FALSE → ❌ PROCEED (full uninstall runs)

Scenario 2: Non-Clean Restart (modules loaded, version matches)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=TRUE → Continue
  [3] driverVersion="580.82.07" → Continue
  [4] detectCurrentDriverVersion()="580.82.07" → Continue
  [5] "580.82.07" == "580.82.07" → ✅ SKIP (early exit, no cleanup)

Scenario 3: Version Mismatch (upgrade needed)
  [1] forceReinstall=false → Continue
  [2] isDriverLoaded()=TRUE → Continue
  [3] driverVersion="580.82.14" → Continue
  [4] detectCurrentDriverVersion()="580.82.07" → Continue
  [5] "580.82.07" != "580.82.14" → ❌ PROCEED (full uninstall runs)


RETURN VALUES:
═════════════

(false, "") → Don't skip, proceed with full uninstall
(true, "desired version already present") → Skip uninstall, exit early```


if skip, reason := dm.shouldSkipUninstall(); skip {
dm.log.Infof("Skipping driver uninstall: %s", reason)
return nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning early here can be problematic. Even if no nvidia modules are loaded, there are other operations that get executed below, like handling the vfio-pci driver unloading and waiting for MOFED, that are still relevant. We may want to refactor the control flow here.

Copy link
Member Author

@karthikvetrivel karthikvetrivel Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! On second look, I've refactored this so that the our stateless operations (vfio-pci, MOFED, nouveau) always run and the stateful ones (uncordon, reschedule) only run when we're tearing down.

EDIT: this actually broke the code. The Daemonset is able to go back to running state & functional w/o this change so I reverted it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually broke the code.

What was the behavior you observed (if you recall still)?

Copy link
Member Author

@karthikvetrivel karthikvetrivel Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't exactly remember but the DS couldn't achieve a running state. I think this makes sense though, forcing those “stateless” steps (e.g. vfio-pci) even during the skip path left the GPUs unbound from the driver, so the driver pod never recovered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes -- performing the unbind is problematic as the driver container (as you have implemented it currently) will not re-load the module when it starts again, doh!

@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 68adf6a to 1991b8c Compare October 16, 2025 15:11
@karthikvetrivel karthikvetrivel marked this pull request as ready for review October 16, 2025 15:19
@karthikvetrivel karthikvetrivel force-pushed the feat/avoid-reinstall-gpu-container branch from 900f54b to 1991b8c Compare October 20, 2025 20:53
@karthikvetrivel
Copy link
Member Author

@cdesiniotis I've moved shouldSkipUninstall so that the operands still release /run/nvidia/driver mounts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants