[ssw][ha] add doc for dpu restart #2175

zjswhhh · 2026-01-07T23:55:13Z

When DPU restarts (either as a planned or an unexpected event), DPU_[APPL|STATE] _DB that are hosted on NPU will not be restarted along with as DPU. This causes some states to be out of sync and leads to unexpected behaviors in HA scenarios.

Adding doc for solution proposal. Pending review.

sign-off: Jing Zhang [email protected]

mssonicbld · 2026-01-07T23:55:21Z

/azp run

azure-pipelines · 2026-01-07T23:55:27Z

No pipelines are associated with this pull request.

doc/smart-switch/high-availability/dpu_reboot.md

mssonicbld · 2026-01-13T18:57:53Z

/azp run

azure-pipelines · 2026-01-13T18:58:00Z

No pipelines are associated with this pull request.

mssonicbld · 2026-01-15T21:15:30Z

/azp run

azure-pipelines · 2026-01-15T21:15:36Z

No pipelines are associated with this pull request.

mssonicbld · 2026-01-26T05:22:38Z

/azp run

azure-pipelines · 2026-01-26T05:22:44Z

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>

* tsc/frr/sonic_frr_update_process.md Signed-off-by: Eddie Ruan <[email protected]> * Update CVE link Signed-off-by: Eddie Ruan <[email protected]> --------- Signed-off-by: Eddie Ruan <[email protected]> Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-26T05:22:58Z

/azp run

azure-pipelines · 2026-01-26T05:23:04Z

No pipelines are associated with this pull request.

tsc/frr/sonic_frr_update_process.md

doc/smart-switch/high-availability/dpu_reboot.md

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-26T20:09:00Z

/azp run

azure-pipelines · 2026-01-26T20:09:07Z

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-27T17:48:31Z

/azp run

azure-pipelines · 2026-01-27T17:48:39Z

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-27T20:43:08Z

/azp run

azure-pipelines · 2026-01-27T20:43:15Z

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-27T20:46:58Z

/azp run

azure-pipelines · 2026-01-27T20:47:05Z

No pipelines are associated with this pull request.

yue-fred-gao · 2026-01-27T21:40:09Z

doc/smart-switch/high-availability/dpu_reboot.md

+## Solutions
+The following are the proposed actions during DPU reboot.
+
+1. Cleanup `DPU_*_DB` instances when DPU boots up.


Cleanup happens when swss is restarted. It covers both DPU reboot and critical processes restart

yue-fred-gao · 2026-01-27T21:40:41Z

doc/smart-switch/high-availability/dpu_reboot.md

+
+1. Cleanup `DPU_*_DB` instances when DPU boots up.
+1. SDN controller needs to monitor NPU `STATE_DB` entries below: 
+    1. `CHASSIS_MODULE_TABLE|DPU<dpu_index>: {'admin_status': 'up|down', 'oper_status': 'up|down'}`  


should this be removed?

This is still required to confirm DPU is ready to be configured.

yue-fred-gao · 2026-01-27T21:42:48Z

doc/smart-switch/high-availability/dpu_reboot.md

+        This field indicates if dpu is ready to be provisioned. Hamgrd will set the value to `false` when `dpu_control_plane_state` or `dpu_midplane_link_state` goes down, and set the value back to `true` when the states are up. 
+
+1. SDN controller will then   
+    1. Delete stale HA_SET_CONFIG and HA_SCOPE_CONFIG  


Please add the timing, when reset_status changes to true

Included already in last bullet point.

yue-fred-gao · 2026-01-27T21:43:28Z

doc/smart-switch/high-availability/dpu_reboot.md

+
+1. SDN controller will then   
+    1. Delete stale HA_SET_CONFIG and HA_SCOPE_CONFIG  
+    1. Re-program DASH objects, HA_SET_CONFIG and HA_SCOPE_CONFIG  


add condition when dpu_ready changes to true

Included already in last bullet point.

yue-fred-gao · 2026-01-27T21:44:45Z

doc/smart-switch/high-availability/dpu_reboot.md

+    1. Delete stale HA_SET_CONFIG and HA_SCOPE_CONFIG  
+    1. Re-program DASH objects, HA_SET_CONFIG and HA_SCOPE_CONFIG  
+    1. Once services are provisioned, HA is programmed, SDN controller will set `reset_status` to `false` 
+1. Hamgrd needs to change the passive BFD session creation logic, today the sessions are created statically. We need this change to avoid hamgrd restart. Hamgrd needs to create BFD session if `dpu_control_plane_state` changes from down to up. 


or dpu_midplane_state from down to up

checked Dylan's implementation, both state should be up?

Updated the working a bit to avoid confusion.

yue-fred-gao · 2026-01-27T21:47:33Z

doc/smart-switch/high-availability/dpu_reboot.md

+    hamgrd->>DPU_APPL_DB: 13. Create BFD passive sessions. 
+    hamgrd->>NPU STATE_DB: 14. DPU_RESET_INFO|DPU0: {"dpu_ready": "true"}
+
+    NPU STATE_DB->>SDN Controller: 15. DPU_RESET_INFO|DPU0: {"dpu_ready": "true"} && CHASSIS_MODULE_TABLE|DPU0 {'admin_status':'up'} &&CHASSIS_MODULE_TABLE|DPU0 {'oper_status':'up'} 


Did you see SDN controller should not access CHASSIS_MODULE_TABLE?

SDN controller will not use CHASSIS_STATE_DB. CHASSIS_MODULE_TABLE is in NPU's STATE_DB.

yue-fred-gao · 2026-01-27T22:14:17Z

doc/smart-switch/high-availability/dpu_reboot.md

+
+
+    SDN Controller->>hamgrd: 16. Create DASH objects, HA_SCOPE_CONFIG and HA_SET_CONFIG 
+    SDN Controller->>NPU STATE_DB: 17. DPU_RESET_INFO|DPU0: {"reset_status": "false"}


If DPU crashed before reaching step 17, hamgrd will sets reset_status to true, which is already true but timestamp will be new. Controller needs to restart the reset process (delete ha scope/ha set). I still think we should converge on dpu_ready so controller only needs to watch this field. If it is false, start the reset process by removing ha config. When it is true, adding ha config etc.

Yes, step 17 is actually to write the DB entry. Updated for clarity.

I think we need to specifically call out to controller that it is possible dpu crashes again before controller sets reset_status to false. So it needs to keep track of status last update time to detect this situation. Otherwise, DPU will be stuck in the bad state.

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-28T01:46:27Z

/azp run

azure-pipelines · 2026-01-28T01:46:33Z

No pipelines are associated with this pull request.

mssonicbld · 2026-01-28T01:51:04Z

/azp run

azure-pipelines · 2026-01-28T01:51:10Z

No pipelines are associated with this pull request.

yue-fred-gao · 2026-01-28T14:26:46Z

doc/smart-switch/high-availability/dpu_reboot.md

+1. SDN controller needs to monitor NPU `STATE_DB` entries below: 
+    1. `CHASSIS_MODULE_TABLE|DPU<dpu_index>: {'admin_status': 'up|down', 'oper_status': 'up|down'}`  
+        This table entry will be updated by [SmartSwitch pmon](https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/pmon/smartswitch-pmon.md).
+    1. `DASH_DPU_RESET_INFO_TABLE|DPU<dpu_index>: {'reset_status': 'true|false', 'last_reset_status_update': timestamp}`  


Are you going to change the key to vDPU id?

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-28T19:00:39Z

/azp run

azure-pipelines · 2026-01-28T19:00:46Z

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-28T19:57:27Z

/azp run

azure-pipelines · 2026-01-28T19:57:34Z

No pipelines are associated with this pull request.

prsunny · 2026-01-29T00:23:01Z

Please add PRs to description following template

Signed-off-by: Jing Zhang <[email protected]>

mssonicbld · 2026-01-30T20:05:05Z

/azp run

azure-pipelines · 2026-01-30T20:05:11Z

No pipelines are associated with this pull request.

…wn (#137)  **What I did** When dpu midplane or control plane go down, write DASH_DPU_RESET_INFO_TABLE **Why I did it** This is to handle DPU reboot or DPU critical process restart. We use the table to notify SDN controller to take action. For details, see sonic-net/SONiC#2175 **How I verified it** verified in hardware. when dpu is rebooted, or some critical process restarted, below entry was created or updated root@MtFuji-dut02:/home/cisco# sonic-db-cli STATE_DB hgetall DASH_DPU_RESET_INFO\|DPU0 {'timestamp': '1769715782213', 'dpu_id': 'dpu1_0', 'vdpu_id': 'vdpu1_0', 'reset_status': 'true'} **Details if related** --------- Signed-off-by: Yue Gao <[email protected]>

zjswhhh requested review from prsunny and yue-fred-gao January 7, 2026 23:56

yue-fred-gao reviewed Jan 8, 2026

View reviewed changes

doc/smart-switch/high-availability/dpu_reboot.md Outdated Show resolved Hide resolved

doc/smart-switch/high-availability/dpu_reboot.md Show resolved Hide resolved

doc/smart-switch/high-availability/dpu_reboot.md Outdated Show resolved Hide resolved

zjswhhh mentioned this pull request Jan 26, 2026

[ssw] clean up DPU_APPL_DB and DPU_STATE_DB for DPU swss restart or DPU reboot sonic-net/sonic-buildimage#25187

Open

8 tasks

zjswhhh and others added 6 commits January 26, 2026 05:22

..

d12076f

Signed-off-by: Jing Zhang <[email protected]>

..

3a87bc2

Signed-off-by: Jing Zhang <[email protected]>

..

5cff490

Signed-off-by: Jing Zhang <[email protected]>

address commetns

fc67ab7

Signed-off-by: Jing Zhang <[email protected]>

address comments

83485ac

Signed-off-by: Jing Zhang <[email protected]>

zjswhhh force-pushed the share branch from 91fc117 to 7adee42 Compare January 26, 2026 05:22

zjswhhh requested review from r12f and vivekrnv January 26, 2026 05:24

prsunny reviewed Jan 26, 2026

View reviewed changes

tsc/frr/sonic_frr_update_process.md Outdated Show resolved Hide resolved

prsunny reviewed Jan 26, 2026

View reviewed changes

doc/smart-switch/high-availability/dpu_reboot.md Outdated Show resolved Hide resolved

prsunny reviewed Jan 26, 2026

View reviewed changes

doc/smart-switch/high-availability/dpu_reboot.md Outdated Show resolved Hide resolved

prsunny reviewed Jan 26, 2026

View reviewed changes

doc/smart-switch/high-availability/dpu_reboot.md Outdated Show resolved Hide resolved

update per offline

83fd937

Signed-off-by: Jing Zhang <[email protected]>

udpat4e

4f8ff91

Signed-off-by: Jing Zhang <[email protected]>

per offline

1b6d9ce

Signed-off-by: Jing Zhang <[email protected]>

typo

61feba2

Signed-off-by: Jing Zhang <[email protected]>

yue-fred-gao reviewed Jan 27, 2026

View reviewed changes

zjswhhh mentioned this pull request Jan 28, 2026

[ssw][ha] add new table DPU_RESET_INFO sonic-net/sonic-dash-api#55

Closed

minor update

155ce92

Signed-off-by: Jing Zhang <[email protected]>

yue-fred-gao reviewed Jan 28, 2026

View reviewed changes

udpate table name

74cb003

Signed-off-by: Jing Zhang <[email protected]>

zjswhhh force-pushed the share branch from 88f8614 to 74cb003 Compare January 28, 2026 19:00

udpate fields

8c89156

Signed-off-by: Jing Zhang <[email protected]>

yue-fred-gao mentioned this pull request Jan 28, 2026

Write DASH_DPU_RESET_INFO_TABLE when dpu midplane or control plane down sonic-net/sonic-dash-ha#137

Merged

remove dash in the table name

abf7af0

Signed-off-by: Jing Zhang <[email protected]>



		SDN Controller->>hamgrd: 16. Create DASH objects, HA_SCOPE_CONFIG and HA_SET_CONFIG
		SDN Controller->>NPU STATE_DB: 17. DPU_RESET_INFO\|DPU0: {"reset_status": "false"}

[ssw][ha] add doc for dpu restart #2175

Are you sure you want to change the base?

[ssw][ha] add doc for dpu restart #2175

Uh oh!

Conversation

zjswhhh commented Jan 7, 2026

Uh oh!

mssonicbld commented Jan 7, 2026

Uh oh!

azure-pipelines bot commented Jan 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Jan 13, 2026

Uh oh!

azure-pipelines bot commented Jan 13, 2026

Uh oh!

mssonicbld commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

mssonicbld commented Jan 26, 2026

Uh oh!

azure-pipelines bot commented Jan 26, 2026

Uh oh!

mssonicbld commented Jan 26, 2026

Uh oh!

azure-pipelines bot commented Jan 26, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mssonicbld commented Jan 26, 2026

Uh oh!

azure-pipelines bot commented Jan 26, 2026

Uh oh!

mssonicbld commented Jan 27, 2026

Uh oh!

azure-pipelines bot commented Jan 27, 2026

Uh oh!

mssonicbld commented Jan 27, 2026

Uh oh!

azure-pipelines bot commented Jan 27, 2026

Uh oh!

mssonicbld commented Jan 27, 2026

Uh oh!

azure-pipelines bot commented Jan 27, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!