-
Notifications
You must be signed in to change notification settings - Fork 128
Open
Description
NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
OS: Linux inst-zamal-oke-rdma 6.8.0-1026-oracle #27-Ubuntu SMP Thu Apr 24 15:44:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
When I enable rdma exclusive mode in the node pool config, the nodes go into a reboot loop. When I remove that from the manifest, nodes starts functioning normally. This is a node that rdma is set to exclusive already outside of the operator if that matters.
Unless I'm misunderstanding the code, it looks like the operator does not check if the rdma ns mode is already set to exclusive returned by DiscoverRDMASubsystem().
pkg/host/internal/network/network.go:442-468
func (n *network) SetRDMASubsystem(mode string) error {
// No check for current mode
path := "/host/etc/modprobe.d/sriov_network_operator_modules_config.conf"
config := fmt.Sprintf("options ib_core netns_mode=%d\n", modeValue)
err := os.WriteFile(path, []byte(config), 0644)
// Always writes, even if unchanged
}kubectl get sriovnetworknodestate -n nvidia-network-operator -o json | jq -r '.items[] | select(.metadata.name=="10.140.41.76") | .metadata.annotations'
{
"sriovnetwork.openshift.io/current-state": "DrainComplete",
"sriovnetwork.openshift.io/desired-state": "Reboot_Required"
}
Metadata
Metadata
Assignees
Labels
No labels