Skip to content

MOFED detection on host broken in v0.9.0 #123

@kreeuwijk

Description

@kreeuwijk

When enabling setting the GPU Operator to enable RDMA and host MOFED drivers like so:

driver:
  rdma:
    enabled: true
    useHostMofed: true

The k8s-driver-manager init container goes into an endless loop of msg=Waiting for MOFED to be installed.. messages.

This seems to be caused by the logic of this function being inverted:

func (dm *DriverManager) waitForMofedDriver() error {
	dm.log.Info("Waiting for MOFED to be installed")

	var isMofedLoaded func() bool
	if dm.config.useHostMofed {
		isMofedLoaded = func() bool {
			_, err := os.Stat("/run/mellanox/drivers/.driver-ready")
			return err == nil
		}
	} else {
		isMofedLoaded = func() bool {
			loadedModules, err := os.ReadFile("/proc/modules")
			if err != nil {
				dm.log.Warnf("Failed to read /proc/modules: %v", err)
				return false
			}
			return strings.Contains(string(loadedModules), "mlx5_core")
		}
	}

	for !isMofedLoaded() {
		dm.log.Info("Waiting for MOFED to be installed...")
		time.Sleep(5 * time.Second)
	}

	return nil
}

When dm.config.useHostMofed evaluates to true, the function is looking for /run/mellanox/drivers/.driver-ready to exist, but this file is not present when using host mofed drivers. The /proc/modules files does contain the mlx5_core module but that check is not performed. It seems these checks should be switched places, so that /proc/modules is searched when useHostMofed=true.

If I manually create the /run/mellanox/drivers/.driver-ready file on the host, the check succeeds and the process moved forward.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions