Skip to content

fix: Mellanox reboot loop and improve devlink parameter reconciliation#145

Merged
almaslennikov merged 1 commit intoMellanox:network-operator-26.1.xfrom
rollandf:loop-restore
Jan 22, 2026
Merged

fix: Mellanox reboot loop and improve devlink parameter reconciliation#145
almaslennikov merged 1 commit intoMellanox:network-operator-26.1.xfrom
rollandf:loop-restore

Conversation

@rollandf
Copy link
Member

  • Avoid unnecessary reboots by checking Mellanox firmware multiport state.
  • Ensure devlink parameter changes trigger interface reconciliation.

- Avoid unnecessary reboots by checking Mellanox firmware multiport state.
- Ensure devlink parameter changes trigger interface reconciliation.

Signed-off-by: Fred Rolland <frolland@nvidia.com>
@github-actions
Copy link

Thanks for your PR,
To run vendors CIs, Maintainers can use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs, Maintainers can use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor.
    Best regards.

@github-actions github-actions bot added the tests label Jan 22, 2026
@almaslennikov almaslennikov merged commit e80edf4 into Mellanox:network-operator-26.1.x Jan 22, 2026
10 of 12 checks passed
Copy link
Collaborator

@e0ne e0ne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can be merged once CI passes

@greptile-apps
Copy link

greptile-apps bot commented Jan 22, 2026

Greptile Summary

This PR prevents unnecessary reboots on Mellanox NICs by checking if the firmware already has the desired multiport state before triggering a reboot. It also ensures that devlink parameter changes trigger interface reconciliation by integrating NeedToUpdateDevlinkParams into the NeedToUpdateSriov function.

Key Changes:

  • Added NeedToUpdateDevlinkParams function to detect when devlink parameters need updates
  • Modified HandleESwitchParams to check current firmware state (LagResourceAllocation) before deciding to reboot
  • Added logic to skip firmware changes when LagResourceAllocation is not supported
  • Added logic to skip reboots when firmware already has the desired multiport setting
  • Updated tests to pass fwCurrent parameter to HandleESwitchParams

Issues Found:

  • Logic bug in HandleESwitchParams at lines 411-418: when devlinkParam is nil (parameter not found in status), the code defaults desiredMultiport to 1, which causes incorrect behavior if the spec requests esw_multiport: false

Confidence Score: 3/5

  • This PR is mostly safe but contains a logic bug that could cause incorrect multiport configuration in specific scenarios
  • The PR successfully achieves its goal of preventing unnecessary reboots and improves devlink parameter reconciliation. However, there's a logic bug in HandleESwitchParams where desiredMultiport incorrectly defaults to 1 when devlinkParam is nil, causing wrong behavior when spec requests esw_multiport: false and it's not in status. The tests don't cover this edge case. The rest of the changes are well-structured and correct.
  • Pay close attention to pkg/vendors/mellanox/mellanox.go lines 411-418 which contain the logic bug

Important Files Changed

Filename Overview
api/v1/helper.go Adds NeedToUpdateDevlinkParams function to check if devlink parameters need updates, and integrates it into NeedToUpdateSriov. Logic is sound and handles missing parameters correctly.
pkg/vendors/mellanox/mellanox.go Improves multiport reboot logic by checking firmware state before rebooting. Contains a logic bug in HandleESwitchParams when devlinkParam is nil and spec requests false.
pkg/plugins/mellanox/mellanox_plugin.go Passes fwCurrent to HandleESwitchParams to enable firmware state checking. Simple and correct change.
pkg/vendors/mellanox/mellanox_test.go Updates all test calls to HandleESwitchParams with new fwCurrent parameter and adds new test cases. Missing test case for esw_multiport: false when not in status.

Sequence Diagram

sequenceDiagram
    participant Plugin as MellanoxPlugin
    participant Helper as mellanox.HandleESwitchParams
    participant FW as Firmware Query
    participant Status as Interface Status
    
    Plugin->>FW: GetMlxNicFwData(pciAddress)
    Note over FW: Queries TotalVfs, EnableSriov,<br/>LinkTypeP1, LinkTypeP2,<br/>LagResourceAllocation
    FW-->>Plugin: fwCurrent (includes Multiport state)
    
    Plugin->>Helper: HandleESwitchParams(pciPrefix, attrs, fwCurrent, spec, status)
    Helper->>Status: isESwitchParamsRequireChange(spec, status)
    Note over Status: Checks if esw_multiport<br/>in spec differs from status
    Status-->>Helper: needChange, devlinkParam
    
    alt needChange is true
        Helper->>Helper: Determine desiredMultiport from devlinkParam
        Note over Helper: Bug: defaults to 1 when<br/>devlinkParam is nil
        
        alt fwCurrent.Multiport == -1
            Note over Helper: Firmware doesn't support<br/>LagResourceAllocation
            Helper-->>Plugin: false (skip firmware change)
        else fwCurrent.Multiport == desiredMultiport
            Note over Helper: Firmware already has<br/>desired state
            Helper-->>Plugin: false (skip reboot)
        else fwCurrent.Multiport != desiredMultiport
            Note over Helper: Firmware needs update
            Helper->>Helper: Set attrs.Multiport = desiredMultiport
            Helper-->>Plugin: true (reboot needed)
        end
    else needChange is false
        Helper-->>Plugin: false (no change needed)
    end
    
    Plugin->>Plugin: NeedToUpdateSriov checks devlink params
    Note over Plugin: New: Now triggers reconciliation<br/>when devlink params change
Loading

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +411 to 418
desiredMultiport := 1
if devlinkParam != nil {
if devlinkParam.Value == devlinkMultiportEnableValue {
attr.Multiport = 1
desiredMultiport = 1
} else {
attr.Multiport = 0
desiredMultiport = 0
}
} else {
attr.Multiport = 1
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: when devlinkParam is nil (which happens when esw_multiport is requested in spec but not found in status, per line 494-496), desiredMultiport defaults to 1. This causes incorrect behavior when the spec requests esw_multiport: false - it would try to enable (1) instead of disable (0).

Suggested change
desiredMultiport := 1
if devlinkParam != nil {
if devlinkParam.Value == devlinkMultiportEnableValue {
attr.Multiport = 1
desiredMultiport = 1
} else {
attr.Multiport = 0
desiredMultiport = 0
}
} else {
attr.Multiport = 1
}
desiredMultiport := 1
if devlinkParam != nil {
if devlinkParam.Value == devlinkMultiportEnableValue {
desiredMultiport = 1
} else {
desiredMultiport = 0
}
} else {
// devlinkParam is nil when parameter not found in status (line 494-496)
// Need to look up the actual requested value from spec
for _, param := range firstPortSpec.DevlinkParams.Params {
if param.Name == devlinkMultiportParamName {
if param.Value == devlinkMultiportEnableValue {
desiredMultiport = 1
} else {
desiredMultiport = 0
}
break
}
}
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants