Skip to content

Conversation

@zjswhhh
Copy link
Contributor

@zjswhhh zjswhhh commented Jan 7, 2026

When DPU restarts (either as a planned or an unexpected event), DPU_[APPL|STATE] _DB that are hosted on NPU will not be restarted along with as DPU. This causes some states to be out of sync and leads to unexpected behaviors in HA scenarios.

Adding doc for solution proposal. Pending review.

sign-off: Jing Zhang [email protected]

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

zjswhhh and others added 6 commits January 26, 2026 05:22
Signed-off-by: Jing Zhang <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
* tsc/frr/sonic_frr_update_process.md

Signed-off-by: Eddie Ruan <[email protected]>

* Update CVE link

Signed-off-by: Eddie Ruan <[email protected]>

---------

Signed-off-by: Eddie Ruan <[email protected]>
Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@zjswhhh zjswhhh requested review from r12f and vivekrnv January 26, 2026 05:24
Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

## Solutions
The following are the proposed actions during DPU reboot.

1. Cleanup `DPU_*_DB` instances when DPU boots up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup happens when swss is restarted. It covers both DPU reboot and critical processes restart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


1. Cleanup `DPU_*_DB` instances when DPU boots up.
1. SDN controller needs to monitor NPU `STATE_DB` entries below:
1. `CHASSIS_MODULE_TABLE|DPU<dpu_index>: {'admin_status': 'up|down', 'oper_status': 'up|down'}`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still required to confirm DPU is ready to be configured.

This field indicates if dpu is ready to be provisioned. Hamgrd will set the value to `false` when `dpu_control_plane_state` or `dpu_midplane_link_state` goes down, and set the value back to `true` when the states are up.

1. SDN controller will then
1. Delete stale HA_SET_CONFIG and HA_SCOPE_CONFIG

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the timing, when reset_status changes to true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included already in last bullet point.


1. SDN controller will then
1. Delete stale HA_SET_CONFIG and HA_SCOPE_CONFIG
1. Re-program DASH objects, HA_SET_CONFIG and HA_SCOPE_CONFIG

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add condition when dpu_ready changes to true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included already in last bullet point.

1. Delete stale HA_SET_CONFIG and HA_SCOPE_CONFIG
1. Re-program DASH objects, HA_SET_CONFIG and HA_SCOPE_CONFIG
1. Once services are provisioned, HA is programmed, SDN controller will set `reset_status` to `false`
1. Hamgrd needs to change the passive BFD session creation logic, today the sessions are created statically. We need this change to avoid hamgrd restart. Hamgrd needs to create BFD session if `dpu_control_plane_state` changes from down to up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or dpu_midplane_state from down to up

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checked Dylan's implementation, both state should be up?

Updated the working a bit to avoid confusion.

hamgrd->>DPU_APPL_DB: 13. Create BFD passive sessions.
hamgrd->>NPU STATE_DB: 14. DPU_RESET_INFO|DPU0: {"dpu_ready": "true"}

NPU STATE_DB->>SDN Controller: 15. DPU_RESET_INFO|DPU0: {"dpu_ready": "true"} && CHASSIS_MODULE_TABLE|DPU0 {'admin_status':'up'} &&CHASSIS_MODULE_TABLE|DPU0 {'oper_status':'up'}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see SDN controller should not access CHASSIS_MODULE_TABLE?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDN controller will not use CHASSIS_STATE_DB. CHASSIS_MODULE_TABLE is in NPU's STATE_DB.



SDN Controller->>hamgrd: 16. Create DASH objects, HA_SCOPE_CONFIG and HA_SET_CONFIG
SDN Controller->>NPU STATE_DB: 17. DPU_RESET_INFO|DPU0: {"reset_status": "false"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If DPU crashed before reaching step 17, hamgrd will sets reset_status to true, which is already true but timestamp will be new. Controller needs to restart the reset process (delete ha scope/ha set). I still think we should converge on dpu_ready so controller only needs to watch this field. If it is false, start the reset process by removing ha config. When it is true, adding ha config etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, step 17 is actually to write the DB entry. Updated for clarity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to specifically call out to controller that it is possible dpu crashes again before controller sets reset_status to false. So it needs to keep track of status last update time to detect this situation. Otherwise, DPU will be stuck in the bad state.

Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

1. SDN controller needs to monitor NPU `STATE_DB` entries below:
1. `CHASSIS_MODULE_TABLE|DPU<dpu_index>: {'admin_status': 'up|down', 'oper_status': 'up|down'}`
This table entry will be updated by [SmartSwitch pmon](https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/pmon/smartswitch-pmon.md).
1. `DASH_DPU_RESET_INFO_TABLE|DPU<dpu_index>: {'reset_status': 'true|false', 'last_reset_status_update': timestamp}`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to change the key to vDPU id?

Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

Signed-off-by: Jing Zhang <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@prsunny
Copy link
Contributor

prsunny commented Jan 29, 2026

Please add PRs to description following template

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

zjswhhh pushed a commit to sonic-net/sonic-dash-ha that referenced this pull request Feb 1, 2026
…wn (#137)

<!--
Please make sure you have read and understood the contribution
guildlines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

1. Make sure your commit includes a signature generted with `git commit
-s`
2. Make sure your commit title follows the correct format: [component]:
description
3. Make sure your commit message contains enough details about the
change and related tests
4. Make sure your pull request adds related reviewers, asignees, labels

Please also provide the following information in this pull request:
-->
**What I did**
When dpu midplane or control plane go down, write
DASH_DPU_RESET_INFO_TABLE
**Why I did it**
This is to handle DPU reboot or DPU critical process restart. We use the
table to notify SDN controller to take action. For details, see
sonic-net/SONiC#2175
**How I verified it**
verified in hardware.
when dpu is rebooted, or some critical process restarted, below entry
was created or updated
root@MtFuji-dut02:/home/cisco# sonic-db-cli STATE_DB hgetall
DASH_DPU_RESET_INFO\|DPU0
{'timestamp': '1769715782213', 'dpu_id': 'dpu1_0', 'vdpu_id': 'vdpu1_0',
'reset_status': 'true'}

**Details if related**

---------

Signed-off-by: Yue Gao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants