Add health check for Net Devices #640

wizhaoredhat · 2025-04-09T00:07:02Z

The implementation checks:

If the physical function PF of the SR-IOV devices is carrier down. This
should be marked unhealthy. Normally, SR-IOV would still function when the
PF is carrier down. But in the case of DPUs/IPUs/SmartNics with an embedded
CPU, the PF being down can signal that the embedded CPU is in reset or
shutdown with carrier down.
If any of the devices are gone. This could be due to someone changing
the number of virtual functions. Or in the case of DPUs/IPUs/SmartNics with
an embedded CPU, the driver needed to reset. This will cause the virtual
functions to be removed. All devices that are gone should be marked
unhealthy. Normally this won't be the case since the SR-IOV Network Operator
will be managing the SR-IOV devices. However for DPUs/IPUs/SmartNics with
an embedded CPU, would be externally managed with a separate operator.

Both these can be switched on and off using checkHealthOnPf and
checkHealthOnDeviceExist within the resource config.

wizhaoredhat · 2025-04-09T00:15:17Z

@zeeke / @SchSeba PTAL

coveralls · 2025-04-09T09:34:56Z

Pull Request Test Coverage Report for Build 15006985984

Details

51 of 75 (68.0%) changed or added relevant lines in 6 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-0.2%) to 74.306%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/devices/gen_pci.go	0	3	0.0%
pkg/devices/gen_net.go	0	6	0.0%
pkg/netdevice/netResourcePool.go	47	53	88.68%
pkg/devices/api.go	0	9	0.0%

Totals
Change from base Build 14837814551:	-0.2%
Covered Lines:	2169
Relevant Lines:	2919

💛 - Coveralls

zeeke

Left a few minor comments

pkg/netdevice/netResourcePool_test.go

wizhaoredhat · 2025-04-09T21:12:01Z

@zeeke Please take a look at the changes again.

wizhaoredhat · 2025-04-10T13:09:06Z

@adrianchiris PTAL

SchSeba · 2025-04-23T12:42:46Z

Hi @wizhaoredhat,
please check the CI failed

SchSeba

overall looks good check two small comments.

a following question is where are we going to use this feature? it needs to be part of the sriov-network-operator or a different operator that deploys a device plugin?

pkg/netdevice/netResourcePool.go

SchSeba · 2025-04-23T13:17:50Z

pkg/netdevice/netResourcePool.go

+			}
+		}
+
+		if pfIsUp && deviceExists && !currentHealth {


do you think we can make this long if else statement simpler? took me some time to get it

Please take a look now at the reduce if statement. Let me know if it is better.

Signed-off-by: William Zhao <[email protected]>

Generated updated mocks with "make generate-mocks" Signed-off-by: William Zhao <[email protected]>

The APIs will be used to determine if a device is healthy or not. Generated updated mocks with "make generate-mocks" Signed-off-by: William Zhao <[email protected]>

The implementation checks: 1. If the physical function PF of the SR-IOV devices is carrier down. This should be marked unhealthy. Normally, SR-IOV would still function when the PF is carrier down. But in the case of DPUs/IPUs/SmartNics with an embedded CPU, the PF being down **can** signal that the embedded CPU is in reset or shutdown with carrier down. 2. If any of the devices are gone. This could be due to someone changing the number of virtual functions. Or in the case of DPUs/IPUs/SmartNics with an embedded CPU, the driver needed to reset. This will cause the virtual functions to be removed. All devices that are gone should be marked unhealthy. Normally this won't be the case since the SR-IOV Network Operator will be managing the SR-IOV devices. However for DPUs/IPUs/SmartNics with an embedded CPU, would be externally managed with a separate operator. Both these can be switched on and off using checkHealthOnPf and checkHealthOnDeviceExist within the resource config. Signed-off-by: William Zhao <[email protected]>

Signed-off-by: William Zhao <[email protected]>

wizhaoredhat · 2025-05-13T21:31:26Z

overall looks good check two small comments.

a following question is where are we going to use this feature? it needs to be part of the sriov-network-operator or a different operator that deploys a device plugin?

Some operators use the Device Plugin with the SR-IOV Network Operator. I think some NVIDIA operators do this. But for the DPU Operator, I plan to include this change and use reuse the SR-IOV device plugin. DPUs and Smart NICs would benefit with this change since DPUs or SmartNICs can be reset/reboot/shutdown asynchronously.

One thing I dislike is the fact that kubernetes doesn't put the pod into a failed state when the underlying resource allocation device becomes unhealthy. I think the expectation of k8s is to use liveliness probes. I plan to dig a little bit more on this.

I think these flags included now in the Config Map can be exposed to the Sriov Network Operator API after this change has been merged. I think the SriovNetworkNodePolicy? Just wondering if you think this would be useful for the SR-IOV Network Operator or not.

Copilot

Pull Request Overview

This PR introduces a health check mechanism for net devices by evaluating both the physical function (PF) link status and the existence of PCI devices, and it updates several mock implementations and configuration structs to support this functionality.

Renames and updates function calls in utils.go (deviceExist → DeviceExist)
Enhances resource configuration and interfaces in types.go with health check flags and methods (CheckHealthOnPf and CheckHealthOnDeviceExist)
Adds a new Probe function in netResourcePool.go and corresponding tests in netdevice/netResourcePool_test.go, along with updates in various mocks and device implementations

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
pkg/utils/utils.go	Renamed function to public DeviceExist to reflect its intended API usage
pkg/types/types.go	Added new health check configuration flags and methods in device interfaces
pkg/types/mocks/*	Added mock implementations for GetHealth, SetHealth, and DeviceExists
pkg/resources/server.go	Reduced the check interval from 20 to 5 seconds
pkg/netdevice/netResourcePool.go	Implemented the Probe function to update device health based on PF and device existence checks
pkg/netdevice/netResourcePool_test.go	Added extensive tests for various health check scenarios
pkg/devices/gen_pci.go	Added DeviceExists method using the DeviceExist utility function
pkg/devices/gen_net.go	Added IsPfLinkUp method to inspect PF link status using new unix flags
pkg/devices/api.go	Added GetHealth and SetHealth methods for API devices
go.mod	Updated dependency on golang.org/x/sys
README.md	Updated documentation with the new health check configuration options

pkg/netdevice/netResourcePool_test.go

Signed-off-by: William Zhao <[email protected]> Co-authored-by: Copilot <[email protected]>

wizhaoredhat · 2025-06-25T20:20:02Z

@adrianchiris could you please take a look? This would be useful when the NVIDIA DPU is not managed by the DPF operator.

adrianchiris · 2025-06-30T13:51:02Z

README.md

 | "resourcePrefix"  | N        | Endpoint resource prefix name override. Should not contain special characters                                                          | string Default : "intel.com"                          | "yourcompany.com"                                                      |
 | "deviceType"      | N        | Device Type for a resource pool.                                                                                                       | string value of supported types. Default: "netDevice" | Currently supported values: "accelerator", "netDevice", "auxNetDevice" |
 | "excludeTopology" | N        | Exclude advertising of device's NUMA topology                                                                                          | bool Default: "false"                                 | "excludeTopology": true                                                |
+| "checkHealthOnPf" | N        | Check the health of a net device by inspecting the link state of the PF                                                                                       | bool Default: "false"                                 | "checkHealthOnPf": true                                                |


should this be a list ? that way we dont overcumber the API

e.g
healthChecks: ["DeviceNetworkLink", "DeviceExists"] or similar

I want to add also something like this

healthChecks: ["DeviceNetworkLink", "DeviceExists"], healthCheckInterval: 5,

Also today we have no implementation of probe it was running a sleep loop doing nothing.
let's change that to only start the probe go routine if there is something in the healthChecks list

wizhaoredhat · 2025-07-03T15:52:48Z

pkg/resources/server.go

 		updateSignal:       make(chan bool),
 		stopWatcher:        make(chan bool),
-		checkIntervals:     20, // updates every 20 seconds
+		checkIntervals:     5, // updates every 5 seconds


Light be too short for Telco environments.

wizhaoredhat force-pushed the add_health_check branch 3 times, most recently from 4ba69ac to 6a5c22c Compare April 9, 2025 00:40

zeeke approved these changes Apr 9, 2025

View reviewed changes

pkg/netdevice/netResourcePool_test.go Outdated Show resolved Hide resolved

pkg/netdevice/netResourcePool_test.go Outdated Show resolved Hide resolved

wizhaoredhat force-pushed the add_health_check branch from 6a5c22c to c697be8 Compare April 9, 2025 21:10

zeeke requested review from adrianchiris and ykulazhenkov April 16, 2025 14:16

SchSeba reviewed Apr 23, 2025

View reviewed changes

wizhaoredhat added 5 commits May 13, 2025 17:18

Expose DeviceExist for use when importing utils

3aa9914

Signed-off-by: William Zhao <[email protected]>

Add GetHealth/SetHealth API for supporting DP health checks

d56d95b

Generated updated mocks with "make generate-mocks" Signed-off-by: William Zhao <[email protected]>

Add DeviceExists and IsPfLinkUp API to gen_net and gen_pci

3f71202

The APIs will be used to determine if a device is healthy or not. Generated updated mocks with "make generate-mocks" Signed-off-by: William Zhao <[email protected]>

Revise to fix lint errors and address if-statement complexity

55719bb

Signed-off-by: William Zhao <[email protected]>

wizhaoredhat force-pushed the add_health_check branch from e6bf4fd to 55719bb Compare May 13, 2025 21:20

SchSeba requested a review from Copilot May 27, 2025 13:10

Copilot AI reviewed May 27, 2025

View reviewed changes

pkg/netdevice/netResourcePool_test.go Outdated Show resolved Hide resolved

Fix DeviceExist mock expectation in netResourcePool_Test

ebbed81

Signed-off-by: William Zhao <[email protected]> Co-authored-by: Copilot <[email protected]>

adrianchiris reviewed Jun 30, 2025

View reviewed changes

wizhaoredhat commented Jul 3, 2025

View reviewed changes

Add health check for Net Devices #640

Are you sure you want to change the base?

Add health check for Net Devices #640

Uh oh!

Conversation

wizhaoredhat commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wizhaoredhat commented Apr 9, 2025

Uh oh!

coveralls commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15006985984

Details

💛 - Coveralls

Uh oh!

zeeke left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wizhaoredhat commented Apr 9, 2025

Uh oh!

wizhaoredhat commented Apr 10, 2025

Uh oh!

SchSeba commented Apr 23, 2025

Uh oh!

SchSeba left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SchSeba Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

wizhaoredhat May 13, 2025

Choose a reason for hiding this comment

Uh oh!

wizhaoredhat commented May 13, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

wizhaoredhat commented Jun 25, 2025

Uh oh!

adrianchiris Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SchSeba Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

wizhaoredhat Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wizhaoredhat commented Apr 9, 2025 •

edited

Loading

coveralls commented Apr 9, 2025 •

edited

Loading

adrianchiris Jun 30, 2025 •

edited

Loading