[warm-reboot] Fix probe failure with non-SONiC peers#4304
[warm-reboot] Fix probe failure with non-SONiC peers#4304mattkim25 wants to merge 1 commit intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
a06a18e to
c86119b
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Fix issue where warm-reboot fails when port channels are connected to non-SONiC peer devices (e.g., Arista cEOS, vEOS) in test topologies (e.g., sonic-mgmt) The teamd retry count feature is SONiC-specific. This change allows warm-reboot to proceed in mixed-vendor topologies by skipping the probe for non-SONiC devices instead of failing. - Remove failedPortChannels.append() for non-SONiC peers - Improve warning message to include peer device name and context - Add comment explaining why non-SONiC peers are skipped Signed-off-by: Matthew Kim <mkim@upscaleai.com>
c86119b to
8f5705e
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| # Don't fail the port channel for non-SONiC peers, just skip the probe | ||
| if "sonic" not in peerInfo["descr"].lower(): | ||
| log.log_warning("WARNING: Peer device is not a SONiC device; skipping") | ||
| failedPortChannels.append(portChannel) |
There was a problem hiding this comment.
This is intentional, to have possible non-SONiC peers being present being a failing condition. This is because if the device is going to undergo a warm reboot, and the device assumes that all peer devices are SONiC, then it might incorrectly say that there are no non-SONiC peers and all SONiC peers are running a sufficiently new version; this might result in a warm reboot being allowed and LAGs with non-SONiC peers going down.
There was a problem hiding this comment.
In my sonic-mgmt testbed, all peers are Arista VMs, so this check correctly identifies them as non-SONiC peers and fails. However, this means warm reboot will never be allowed in environments where all peers are non-SONiC, which is the case in my testbed. Is there a recommended way to handle this scenario? or should this be gated behind a configuration option to allow warm reboot in mixed/non-SONiC environments?
There was a problem hiding this comment.
Unless you pass in the -N flag, warm reboot should still be allowed regardless of this check; it should print out Warning: Retry count feature support unknown for one or more neighbor devices; assuming that it's not available, but not block anything. If you pass in -n, then it will block warm reboot.
There was a problem hiding this comment.
When running warm reboot, the message is not Warning but Error:
ERROR: There are port channels/peer devices that failed the probe: ['PortChannel101', 'PortChannel103', 'PortChannel104', 'PortChannel102']
This does not impact completing warm reboot, but the error is being caught and failing the sonic-mgmt warm reboot tests. Perhaps I should file a bug on the sonic-mgmt repo instead to ignore this message if using non sonic peer devices for t0 topology? I am curious how others are ignorring this error message when running warm reboot tests.
What I did
Fixes issues #21996
Fix issue where warm-reboot fails when port channels are connected to non-SONiC peer devices (e.g., Arista cEOS, vEOS).
The
teamd_increase_retry_count.pyscript was incorrectly marking port channels as failed when detecting non-SONiC peers. Since the teamd retry count feature is SONiC-specific, the script should skip the probe for non-SONiC devices rather than failing.How I did it
failedPortChannels.append()for non-SONiC peersHow to verify it
Run the teamd probe on a T0 topology with non-SONiC T1 peers (e.g., Arista cEOS).
Expected behavior: Script logs warnings for non-SONiC peers and exits with code 0. Previously, it would append to
failedPortChannelsand exit with code 2, blocking warm-reboot.Manual probe test:
Warm-reboot test (capture logs):
Previous command output (if the output of a command-line utility has changed)