[warm-reboot] Fix probe failure with non-SONiC peers by mattkim25 · Pull Request #4304 · sonic-net/sonic-utilities

mattkim25 · 2026-02-25T08:41:32Z

What I did

Fixes issues #21996

Fix issue where warm-reboot fails when port channels are connected to non-SONiC peer devices (e.g., Arista cEOS, vEOS).

The teamd_increase_retry_count.py script was incorrectly marking port channels as failed when detecting non-SONiC peers. Since the teamd retry count feature is SONiC-specific, the script should skip the probe for non-SONiC devices rather than failing.

How I did it

Remove failedPortChannels.append() for non-SONiC peers
Improve warning message to include peer device name and context

How to verify it

Run the teamd probe on a T0 topology with non-SONiC T1 peers (e.g., Arista cEOS).

Expected behavior: Script logs warnings for non-SONiC peers and exits with code 0. Previously, it would append to failedPortChannels and exit with code 2, blocking warm-reboot.

Manual probe test:

admin@sonic:~$ sudo /usr/local/bin/teamd_increase_retry_count.py --probe-only
admin@sonic:~$ echo $?
0
admin@sonic:~$ sudo journalctl --since "1 minute ago" | grep -i "Peer device"
Feb 25 08:28:33 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA01T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:28:33 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA02T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:28:33 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA03T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:28:34 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA04T1 is not a SONiC device; skipping teamd retry count probe

Warm-reboot test (capture logs):

admin@sonic:~$ sudo journalctl -f -t teamd_increase_retry_count.py > /home/admin/teamd_probe_logs.txt &
admin@sonic:~$ sudo warm-reboot
...
admin@sonic:~$ cat teamd_probe_logs.txt
Feb 25 08:31:38 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA01T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:31:39 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA02T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:31:39 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA03T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:31:39 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA04T1 is not a SONiC device; skipping teamd retry count probe

Previous command output (if the output of a command-line utility has changed)

admin@sonic:~$ sudo warm-reboot
ERROR: There are port channels/peer devices that failed the probe: ['PortChannel101', 'PortChannel102', 'PortChannel103', 'PortChannel104']

mssonicbld · 2026-02-25T08:41:41Z

/azp run

azure-pipelines · 2026-02-25T08:41:51Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-02-25T08:46:12Z

/azp run

azure-pipelines · 2026-02-25T08:46:22Z

Azure Pipelines successfully started running 1 pipeline(s).

Fix issue where warm-reboot fails when port channels are connected to non-SONiC peer devices (e.g., Arista cEOS, vEOS) in test topologies (e.g., sonic-mgmt) The teamd retry count feature is SONiC-specific. This change allows warm-reboot to proceed in mixed-vendor topologies by skipping the probe for non-SONiC devices instead of failing. - Remove failedPortChannels.append() for non-SONiC peers - Improve warning message to include peer device name and context - Add comment explaining why non-SONiC peers are skipped Signed-off-by: Matthew Kim <mkim@upscaleai.com>

mssonicbld · 2026-02-25T08:48:27Z

/azp run

azure-pipelines · 2026-02-25T08:48:38Z

Azure Pipelines successfully started running 1 pipeline(s).

saiarcot895 · 2026-02-25T17:09:09Z

scripts/teamd_increase_retry_count.py

+                # Don't fail the port channel for non-SONiC peers, just skip the probe
                if "sonic" not in peerInfo["descr"].lower():
-                    log.log_warning("WARNING: Peer device is not a SONiC device; skipping")
-                    failedPortChannels.append(portChannel)


This is intentional, to have possible non-SONiC peers being present being a failing condition. This is because if the device is going to undergo a warm reboot, and the device assumes that all peer devices are SONiC, then it might incorrectly say that there are no non-SONiC peers and all SONiC peers are running a sufficiently new version; this might result in a warm reboot being allowed and LAGs with non-SONiC peers going down.

In my sonic-mgmt testbed, all peers are Arista VMs, so this check correctly identifies them as non-SONiC peers and fails. However, this means warm reboot will never be allowed in environments where all peers are non-SONiC, which is the case in my testbed. Is there a recommended way to handle this scenario? or should this be gated behind a configuration option to allow warm reboot in mixed/non-SONiC environments?

Unless you pass in the -N flag, warm reboot should still be allowed regardless of this check; it should print out Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available, but not block anything. If you pass in -n, then it will block warm reboot.

When running warm reboot, the message is not Warning but Error:

ERROR: There are port channels/peer devices that failed the probe: ['PortChannel101', 'PortChannel103', 'PortChannel104', 'PortChannel102']

This does not impact completing warm reboot, but the error is being caught and failing the sonic-mgmt warm reboot tests. Perhaps I should file a bug on the sonic-mgmt repo instead to ignore this message if using non sonic peer devices for t0 topology? I am curious how others are ignorring this error message when running warm reboot tests.

mattkim25 force-pushed the mkim/fix_portchannel_error-teamd_increase_retry_count branch from a06a18e to c86119b Compare February 25, 2026 08:46

mattkim25 force-pushed the mkim/fix_portchannel_error-teamd_increase_retry_count branch from c86119b to 8f5705e Compare February 25, 2026 08:48

mattkim25 mentioned this pull request Feb 25, 2026

[warm-reboot] ERROR: There are port channels/peer devices that failed the probe: ['PortChannel101', 'PortChannel102', 'PortChannel103', 'PortChannel104'] sonic-net/sonic-buildimage#21996

Open

prsunny requested review from saiarcot895 and volodymyrsamotiy February 25, 2026 17:00

saiarcot895 reviewed Feb 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[warm-reboot] Fix probe failure with non-SONiC peers#4304

[warm-reboot] Fix probe failure with non-SONiC peers#4304
mattkim25 wants to merge 1 commit intosonic-net:masterfrom
mattkim25:mkim/fix_portchannel_error-teamd_increase_retry_count

mattkim25 commented Feb 25, 2026 •

edited

Loading

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

azure-pipelines bot commented Feb 25, 2026

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

azure-pipelines bot commented Feb 25, 2026

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

azure-pipelines bot commented Feb 25, 2026

Uh oh!

saiarcot895 Feb 25, 2026

Uh oh!

mattkim25 Feb 25, 2026

Uh oh!

saiarcot895 Feb 27, 2026 •

edited

Loading

Uh oh!

mattkim25 Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mattkim25 commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What I did

How I did it

How to verify it

Previous command output (if the output of a command-line utility has changed)

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

azure-pipelines bot commented Feb 25, 2026

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

azure-pipelines bot commented Feb 25, 2026

Uh oh!

mssonicbld commented Feb 25, 2026

Uh oh!

azure-pipelines bot commented Feb 25, 2026

Uh oh!

saiarcot895 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

mattkim25 Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

saiarcot895 Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattkim25 Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mattkim25 commented Feb 25, 2026 •

edited

Loading

saiarcot895 Feb 27, 2026 •

edited

Loading