Skip to content

[Smartswitch] Stabilize the DPU kernel panic and memory exhaustion tests#22252

Merged
roy-sror merged 1 commit intosonic-net:masterfrom
congh-nvidia:reload_dpu
Mar 1, 2026
Merged

[Smartswitch] Stabilize the DPU kernel panic and memory exhaustion tests#22252
roy-sror merged 1 commit intosonic-net:masterfrom
congh-nvidia:reload_dpu

Conversation

@congh-nvidia
Copy link
Contributor

Description of PR

Summary:
We should check the DPUs are offline before checking they are online after the kernel panic or memory exhaustion.
Otherwise the check for DPU online could pass even before the DPUs are rebooted and the later critical services check will fail.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Stabilize the DPU kernel panic and memory exhaustion tests

How did you do it?

Add a check for DPUs offline before the post_test_dpus_check in 2 test cases of test test_reload_dpu.py:
test_dpu_status_post_dpu_kernel_panic
test_dpu_check_post_dpu_mem_exhaustion

How did you verify/test it?

Run the test on SN4280 testbed.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

We should check the DPUs are offline before checking they are online
after the kernel panic or memory exhaustion.
Otherwise the check for DPU online could pass even before the DPUs are
rebooted and the later crirical services check will fail.

Signed-off-by: Cong Hou <congh@nvidia.com>
@congh-nvidia
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@nhe-NV nhe-NV added the Request for 202511 branch Request to backport a change to 202511 branch label Feb 7, 2026
@roy-sror roy-sror merged commit f0b15ad into sonic-net:master Mar 1, 2026
16 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 2, 2026
…#22252)

We should check the DPUs are offline before checking they are online
after the kernel panic or memory exhaustion.
Otherwise the check for DPU online could pass even before the DPUs are
rebooted and the later crirical services check will fail.

Signed-off-by: mssonicbld <sonicbld@microsoft.com>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #22691

aronovic pushed a commit to aronovic/sonic-mgmt that referenced this pull request Mar 3, 2026
…#22252)

We should check the DPUs are offline before checking they are online
after the kernel panic or memory exhaustion.
Otherwise the check for DPU online could pass even before the DPUs are
rebooted and the later crirical services check will fail.

Signed-off-by: Mihut Aronovici <aronovic@cisco.com>
rraghav-cisco pushed a commit to rraghav-cisco/sonic-mgmt that referenced this pull request Mar 3, 2026
…#22252)

We should check the DPUs are offline before checking they are online
after the kernel panic or memory exhaustion.
Otherwise the check for DPU online could pass even before the DPUs are
rebooted and the later crirical services check will fail.

Signed-off-by: Raghavendran Ramanathan <rraghav@cisco.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants