-
Notifications
You must be signed in to change notification settings - Fork 17
Description
When enabling setting the GPU Operator to enable RDMA and host MOFED drivers like so:
driver:
rdma:
enabled: true
useHostMofed: true
The k8s-driver-manager init container goes into an endless loop of msg=Waiting for MOFED to be installed.. messages.
This seems to be caused by the logic of this function being inverted:
func (dm *DriverManager) waitForMofedDriver() error {
dm.log.Info("Waiting for MOFED to be installed")
var isMofedLoaded func() bool
if dm.config.useHostMofed {
isMofedLoaded = func() bool {
_, err := os.Stat("/run/mellanox/drivers/.driver-ready")
return err == nil
}
} else {
isMofedLoaded = func() bool {
loadedModules, err := os.ReadFile("/proc/modules")
if err != nil {
dm.log.Warnf("Failed to read /proc/modules: %v", err)
return false
}
return strings.Contains(string(loadedModules), "mlx5_core")
}
}
for !isMofedLoaded() {
dm.log.Info("Waiting for MOFED to be installed...")
time.Sleep(5 * time.Second)
}
return nil
}
When dm.config.useHostMofed evaluates to true, the function is looking for /run/mellanox/drivers/.driver-ready to exist, but this file is not present when using host mofed drivers. The /proc/modules files does contain the mlx5_core module but that check is not performed. It seems these checks should be switched places, so that /proc/modules is searched when useHostMofed=true.
If I manually create the /run/mellanox/drivers/.driver-ready file on the host, the check succeeds and the process moved forward.