fix: Mellanox reboot loop and improve devlink parameter reconciliation#145
Conversation
rollandf
commented
Jan 22, 2026
- Avoid unnecessary reboots by checking Mellanox firmware multiport state.
- Ensure devlink parameter changes trigger interface reconciliation.
- Avoid unnecessary reboots by checking Mellanox firmware multiport state. - Ensure devlink parameter changes trigger interface reconciliation. Signed-off-by: Fred Rolland <frolland@nvidia.com>
|
Thanks for your PR,
To skip the vendors CIs, Maintainers can use one of:
|
e80edf4
into
Mellanox:network-operator-26.1.x
e0ne
left a comment
There was a problem hiding this comment.
LGTM. Can be merged once CI passes
Greptile SummaryThis PR prevents unnecessary reboots on Mellanox NICs by checking if the firmware already has the desired multiport state before triggering a reboot. It also ensures that devlink parameter changes trigger interface reconciliation by integrating Key Changes:
Issues Found:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Plugin as MellanoxPlugin
participant Helper as mellanox.HandleESwitchParams
participant FW as Firmware Query
participant Status as Interface Status
Plugin->>FW: GetMlxNicFwData(pciAddress)
Note over FW: Queries TotalVfs, EnableSriov,<br/>LinkTypeP1, LinkTypeP2,<br/>LagResourceAllocation
FW-->>Plugin: fwCurrent (includes Multiport state)
Plugin->>Helper: HandleESwitchParams(pciPrefix, attrs, fwCurrent, spec, status)
Helper->>Status: isESwitchParamsRequireChange(spec, status)
Note over Status: Checks if esw_multiport<br/>in spec differs from status
Status-->>Helper: needChange, devlinkParam
alt needChange is true
Helper->>Helper: Determine desiredMultiport from devlinkParam
Note over Helper: Bug: defaults to 1 when<br/>devlinkParam is nil
alt fwCurrent.Multiport == -1
Note over Helper: Firmware doesn't support<br/>LagResourceAllocation
Helper-->>Plugin: false (skip firmware change)
else fwCurrent.Multiport == desiredMultiport
Note over Helper: Firmware already has<br/>desired state
Helper-->>Plugin: false (skip reboot)
else fwCurrent.Multiport != desiredMultiport
Note over Helper: Firmware needs update
Helper->>Helper: Set attrs.Multiport = desiredMultiport
Helper-->>Plugin: true (reboot needed)
end
else needChange is false
Helper-->>Plugin: false (no change needed)
end
Plugin->>Plugin: NeedToUpdateSriov checks devlink params
Note over Plugin: New: Now triggers reconciliation<br/>when devlink params change
|
| desiredMultiport := 1 | ||
| if devlinkParam != nil { | ||
| if devlinkParam.Value == devlinkMultiportEnableValue { | ||
| attr.Multiport = 1 | ||
| desiredMultiport = 1 | ||
| } else { | ||
| attr.Multiport = 0 | ||
| desiredMultiport = 0 | ||
| } | ||
| } else { | ||
| attr.Multiport = 1 | ||
| } |
There was a problem hiding this comment.
logic: when devlinkParam is nil (which happens when esw_multiport is requested in spec but not found in status, per line 494-496), desiredMultiport defaults to 1. This causes incorrect behavior when the spec requests esw_multiport: false - it would try to enable (1) instead of disable (0).
| desiredMultiport := 1 | |
| if devlinkParam != nil { | |
| if devlinkParam.Value == devlinkMultiportEnableValue { | |
| attr.Multiport = 1 | |
| desiredMultiport = 1 | |
| } else { | |
| attr.Multiport = 0 | |
| desiredMultiport = 0 | |
| } | |
| } else { | |
| attr.Multiport = 1 | |
| } | |
| desiredMultiport := 1 | |
| if devlinkParam != nil { | |
| if devlinkParam.Value == devlinkMultiportEnableValue { | |
| desiredMultiport = 1 | |
| } else { | |
| desiredMultiport = 0 | |
| } | |
| } else { | |
| // devlinkParam is nil when parameter not found in status (line 494-496) | |
| // Need to look up the actual requested value from spec | |
| for _, param := range firstPortSpec.DevlinkParams.Params { | |
| if param.Name == devlinkMultiportParamName { | |
| if param.Value == devlinkMultiportEnableValue { | |
| desiredMultiport = 1 | |
| } else { | |
| desiredMultiport = 0 | |
| } | |
| break | |
| } | |
| } | |
| } |