Skip to content

[warm-reboot] Fix probe failure with non-SONiC peers#4304

Open
mattkim25 wants to merge 1 commit intosonic-net:masterfrom
mattkim25:mkim/fix_portchannel_error-teamd_increase_retry_count
Open

[warm-reboot] Fix probe failure with non-SONiC peers#4304
mattkim25 wants to merge 1 commit intosonic-net:masterfrom
mattkim25:mkim/fix_portchannel_error-teamd_increase_retry_count

Conversation

@mattkim25
Copy link

@mattkim25 mattkim25 commented Feb 25, 2026

What I did

Fixes issues #21996

Fix issue where warm-reboot fails when port channels are connected to non-SONiC peer devices (e.g., Arista cEOS, vEOS).

The teamd_increase_retry_count.py script was incorrectly marking port channels as failed when detecting non-SONiC peers. Since the teamd retry count feature is SONiC-specific, the script should skip the probe for non-SONiC devices rather than failing.

How I did it

  • Remove failedPortChannels.append() for non-SONiC peers
  • Improve warning message to include peer device name and context

How to verify it

Run the teamd probe on a T0 topology with non-SONiC T1 peers (e.g., Arista cEOS).

Expected behavior: Script logs warnings for non-SONiC peers and exits with code 0. Previously, it would append to failedPortChannels and exit with code 2, blocking warm-reboot.

Manual probe test:

admin@sonic:~$ sudo /usr/local/bin/teamd_increase_retry_count.py --probe-only
admin@sonic:~$ echo $?
0
admin@sonic:~$ sudo journalctl --since "1 minute ago" | grep -i "Peer device"
Feb 25 08:28:33 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA01T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:28:33 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA02T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:28:33 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA03T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:28:34 sonic teamd_increase_retry_count.py[30253]: WARNING: Peer device ARISTA04T1 is not a SONiC device; skipping teamd retry count probe

Warm-reboot test (capture logs):

admin@sonic:~$ sudo journalctl -f -t teamd_increase_retry_count.py > /home/admin/teamd_probe_logs.txt &
admin@sonic:~$ sudo warm-reboot
...
admin@sonic:~$ cat teamd_probe_logs.txt
Feb 25 08:31:38 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA01T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:31:39 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA02T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:31:39 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA03T1 is not a SONiC device; skipping teamd retry count probe
Feb 25 08:31:39 sonic teamd_increase_retry_count.py[33915]: WARNING: Peer device ARISTA04T1 is not a SONiC device; skipping teamd retry count probe

Previous command output (if the output of a command-line utility has changed)

admin@sonic:~$ sudo warm-reboot
ERROR: There are port channels/peer devices that failed the probe: ['PortChannel101', 'PortChannel102', 'PortChannel103', 'PortChannel104']

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mattkim25 mattkim25 force-pushed the mkim/fix_portchannel_error-teamd_increase_retry_count branch from a06a18e to c86119b Compare February 25, 2026 08:46
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Fix issue where warm-reboot fails when port channels are connected to
non-SONiC peer devices (e.g., Arista cEOS, vEOS) in test topologies (e.g., sonic-mgmt)

The teamd retry count feature is SONiC-specific. This change allows warm-reboot
to proceed in mixed-vendor topologies by skipping the probe for non-SONiC devices instead of failing.

- Remove failedPortChannels.append() for non-SONiC peers
- Improve warning message to include peer device name and context
- Add comment explaining why non-SONiC peers are skipped

Signed-off-by: Matthew Kim <mkim@upscaleai.com>
@mattkim25 mattkim25 force-pushed the mkim/fix_portchannel_error-teamd_increase_retry_count branch from c86119b to 8f5705e Compare February 25, 2026 08:48
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

# Don't fail the port channel for non-SONiC peers, just skip the probe
if "sonic" not in peerInfo["descr"].lower():
log.log_warning("WARNING: Peer device is not a SONiC device; skipping")
failedPortChannels.append(portChannel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentional, to have possible non-SONiC peers being present being a failing condition. This is because if the device is going to undergo a warm reboot, and the device assumes that all peer devices are SONiC, then it might incorrectly say that there are no non-SONiC peers and all SONiC peers are running a sufficiently new version; this might result in a warm reboot being allowed and LAGs with non-SONiC peers going down.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my sonic-mgmt testbed, all peers are Arista VMs, so this check correctly identifies them as non-SONiC peers and fails. However, this means warm reboot will never be allowed in environments where all peers are non-SONiC, which is the case in my testbed. Is there a recommended way to handle this scenario? or should this be gated behind a configuration option to allow warm reboot in mixed/non-SONiC environments?

Copy link
Contributor

@saiarcot895 saiarcot895 Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you pass in the -N flag, warm reboot should still be allowed regardless of this check; it should print out Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available, but not block anything. If you pass in -n, then it will block warm reboot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When running warm reboot, the message is not Warning but Error:

ERROR: There are port channels/peer devices that failed the probe: ['PortChannel101', 'PortChannel103', 'PortChannel104', 'PortChannel102']

This does not impact completing warm reboot, but the error is being caught and failing the sonic-mgmt warm reboot tests. Perhaps I should file a bug on the sonic-mgmt repo instead to ignore this message if using non sonic peer devices for t0 topology? I am curious how others are ignorring this error message when running warm reboot tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants