fix: reboot loop and improve devlink parameter reconciliation#134
fix: reboot loop and improve devlink parameter reconciliation#134e0ne merged 1 commit intoMellanox:network-operator-26.1.xfrom
Conversation
rollandf
commented
Jan 18, 2026
- Avoid unnecessary reboots by checking Mellanox firmware multiport state.
- Ensure devlink parameter changes trigger interface reconciliation.
- Avoid unnecessary reboots by checking Mellanox firmware multiport state. - Ensure devlink parameter changes trigger interface reconciliation. Signed-off-by: Fred Rolland <frolland@nvidia.com>
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
Greptile SummaryThis PR fixes the Mellanox firmware reboot loop issue and ensures devlink parameter changes trigger proper interface reconciliation. Key Changes
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Plugin as MellanoxPlugin
participant Helper as api/v1/helper
participant Vendor as mellanox/mlxutils
participant FW as Firmware (mstconfig)
Note over Plugin: OnNodeStateChange()
Plugin->>Vendor: GetMlxNicFwData(pciAddress)
Vendor->>FW: MstConfigReadData()
FW-->>Vendor: Current & Next FW data (including LagResourceAllocation)
Vendor->>Vendor: mlnxNicFromMap()
Note over Vendor: Parse LagResourceAllocation<br/>Set Multiport: -1 if unsupported<br/>Set Multiport: 0 or 1 if supported
Vendor-->>Plugin: fwCurrent, fwNext
Plugin->>Vendor: HandleESwitchParams(pciPrefix, attrs, fwCurrent, specs, status)
Vendor->>Vendor: isESwitchParamsRequireChange(spec, status)
Note over Vendor: Check if esw_multiport<br/>param requested
alt Multiport change needed
alt fwCurrent.Multiport == -1
Note over Vendor: LagResourceAllocation not supported<br/>Skip firmware change
Vendor-->>Plugin: needReboot = false
else fwCurrent.Multiport == desiredMultiport
Note over Vendor: Already set in firmware<br/>Skip reboot
Vendor-->>Plugin: needReboot = false
else Multiport needs change
Note over Vendor: Set attrs.Multiport to desired value
Vendor-->>Plugin: needReboot = true
end
else No change needed
Vendor-->>Plugin: needReboot = false
end
Plugin->>Helper: NeedToUpdateSriov(ifaceSpec, ifaceStatus)
Helper->>Helper: NeedToUpdateDevlinkParams(desired, current)
Note over Helper: Compare each devlink param<br/>Check name and value match
Helper-->>Plugin: true if devlink params differ
alt needReboot == true
Plugin->>Vendor: MlxConfigFW(attributesToChange)
Vendor->>FW: mstconfig set LAG_RESOURCE_ALLOCATION=X
Plugin->>Vendor: MlxResetFW(pciAddresses)
Vendor->>FW: mstfwreset --skip_driver
Note over Plugin: Node will reboot
end
|
|
Policy used: apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: rail-0
namespace: network-operator
spec:
bridge:
groupingPolicy: all
ovs:
bridge:
datapathType: netdev
failMode: secure
uplink:
interface:
mtuRequest: 1500
type: dpdk
deviceType: netdevice
devlinkParams:
params:
- applyOn: PF
cmode: runtime
name: esw_multiport
value: "true"
eSwitchMode: switchdev
isRdma: true
linkType: ETH
mtu: 1500
nicSelector:
pfNames:
- eth_p0_r0
nodeSelector:
node-role.kubernetes.io/worker: ""
numVfs: 1
priority: 99
resourceName: rail_0 |
|
Failing log: |
94dfa2a
into
Mellanox:network-operator-26.1.x