Skip to content

[warm-reboot][multi-asic] Added error-handling for faulty ASIC/s after orchagent freeze#4297

Open
YairRaviv wants to merge 1 commit intosonic-net:masterfrom
YairRaviv:yraviv-warm-reboot-masic-error-handling
Open

[warm-reboot][multi-asic] Added error-handling for faulty ASIC/s after orchagent freeze#4297
YairRaviv wants to merge 1 commit intosonic-net:masterfrom
YairRaviv:yraviv-warm-reboot-masic-error-handling

Conversation

@YairRaviv
Copy link
Contributor

What I did

Aligned the error-handling logic for warm/fast reboot on multi-ASIC devices

How I did it

FORCE var is set to "yes" on multi-ASIC devices before pausing orchagents.
I added a condition to "execute_in_namespace" function that in case of failures:

  • If FORCE is false (before non-return point) - exit (fallback to clear_boot)
  • If FORCE is true - removes faulty ASIC from the operational ASIC list, clear state of this ASIC

How to verify it

Tested on a multi-ASIC simulation, added manual failures, and verified the behavior

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…r orchagent freeze

Signed-off-by: Yair Raviv <yraviv@nvidia.com>
@YairRaviv YairRaviv force-pushed the yraviv-warm-reboot-masic-error-handling branch from f1f05f9 to fd34414 Compare February 22, 2026 14:50
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@YairRaviv YairRaviv marked this pull request as ready for review February 23, 2026 06:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants