Skip to content

Adding rdmaMode: exclusive to pool config causes a reboot loop #959

@OguzPastirmaci

Description

@OguzPastirmaci

NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
OS: Linux inst-zamal-oke-rdma 6.8.0-1026-oracle #27-Ubuntu SMP Thu Apr 24 15:44:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

When I enable rdma exclusive mode in the node pool config, the nodes go into a reboot loop. When I remove that from the manifest, nodes starts functioning normally. This is a node that rdma is set to exclusive already outside of the operator if that matters.

Unless I'm misunderstanding the code, it looks like the operator does not check if the rdma ns mode is already set to exclusive returned by DiscoverRDMASubsystem().

pkg/host/internal/network/network.go:442-468

func (n *network) SetRDMASubsystem(mode string) error {
    // No check for current mode
    path := "/host/etc/modprobe.d/sriov_network_operator_modules_config.conf"
    
    config := fmt.Sprintf("options ib_core netns_mode=%d\n", modeValue)
    err := os.WriteFile(path, []byte(config), 0644)
    // Always writes, even if unchanged
}
kubectl get sriovnetworknodestate -n nvidia-network-operator -o json | jq -r '.items[] | select(.metadata.name=="10.140.41.76") | .metadata.annotations'

{
  "sriovnetwork.openshift.io/current-state": "DrainComplete",
  "sriovnetwork.openshift.io/desired-state": "Reboot_Required"
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions