-
Notifications
You must be signed in to change notification settings - Fork 197
Add health check for Net Devices #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add health check for Net Devices #640
Conversation
4ba69ac to
6a5c22c
Compare
Pull Request Test Coverage Report for Build 15006985984Details
💛 - Coveralls |
zeeke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few minor comments
6a5c22c to
c697be8
Compare
|
@zeeke Please take a look at the changes again. |
|
@adrianchiris PTAL |
|
Hi @wizhaoredhat, |
SchSeba
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good check two small comments.
a following question is where are we going to use this feature? it needs to be part of the sriov-network-operator or a different operator that deploys a device plugin?
| } | ||
| } | ||
|
|
||
| if pfIsUp && deviceExists && !currentHealth { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we can make this long if else statement simpler? took me some time to get it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a look now at the reduce if statement. Let me know if it is better.
Signed-off-by: William Zhao <[email protected]>
Generated updated mocks with "make generate-mocks" Signed-off-by: William Zhao <[email protected]>
The APIs will be used to determine if a device is healthy or not. Generated updated mocks with "make generate-mocks" Signed-off-by: William Zhao <[email protected]>
The implementation checks: 1. If the physical function PF of the SR-IOV devices is carrier down. This should be marked unhealthy. Normally, SR-IOV would still function when the PF is carrier down. But in the case of DPUs/IPUs/SmartNics with an embedded CPU, the PF being down **can** signal that the embedded CPU is in reset or shutdown with carrier down. 2. If any of the devices are gone. This could be due to someone changing the number of virtual functions. Or in the case of DPUs/IPUs/SmartNics with an embedded CPU, the driver needed to reset. This will cause the virtual functions to be removed. All devices that are gone should be marked unhealthy. Normally this won't be the case since the SR-IOV Network Operator will be managing the SR-IOV devices. However for DPUs/IPUs/SmartNics with an embedded CPU, would be externally managed with a separate operator. Both these can be switched on and off using checkHealthOnPf and checkHealthOnDeviceExist within the resource config. Signed-off-by: William Zhao <[email protected]>
Signed-off-by: William Zhao <[email protected]>
e6bf4fd to
55719bb
Compare
Some operators use the Device Plugin with the SR-IOV Network Operator. I think some NVIDIA operators do this. But for the DPU Operator, I plan to include this change and use reuse the SR-IOV device plugin. DPUs and Smart NICs would benefit with this change since DPUs or SmartNICs can be reset/reboot/shutdown asynchronously. One thing I dislike is the fact that kubernetes doesn't put the pod into a failed state when the underlying resource allocation device becomes unhealthy. I think the expectation of k8s is to use liveliness probes. I plan to dig a little bit more on this. I think these flags included now in the Config Map can be exposed to the Sriov Network Operator API after this change has been merged. I think the SriovNetworkNodePolicy? Just wondering if you think this would be useful for the SR-IOV Network Operator or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a health check mechanism for net devices by evaluating both the physical function (PF) link status and the existence of PCI devices, and it updates several mock implementations and configuration structs to support this functionality.
- Renames and updates function calls in utils.go (deviceExist → DeviceExist)
- Enhances resource configuration and interfaces in types.go with health check flags and methods (CheckHealthOnPf and CheckHealthOnDeviceExist)
- Adds a new Probe function in netResourcePool.go and corresponding tests in netdevice/netResourcePool_test.go, along with updates in various mocks and device implementations
Reviewed Changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| pkg/utils/utils.go | Renamed function to public DeviceExist to reflect its intended API usage |
| pkg/types/types.go | Added new health check configuration flags and methods in device interfaces |
| pkg/types/mocks/* | Added mock implementations for GetHealth, SetHealth, and DeviceExists |
| pkg/resources/server.go | Reduced the check interval from 20 to 5 seconds |
| pkg/netdevice/netResourcePool.go | Implemented the Probe function to update device health based on PF and device existence checks |
| pkg/netdevice/netResourcePool_test.go | Added extensive tests for various health check scenarios |
| pkg/devices/gen_pci.go | Added DeviceExists method using the DeviceExist utility function |
| pkg/devices/gen_net.go | Added IsPfLinkUp method to inspect PF link status using new unix flags |
| pkg/devices/api.go | Added GetHealth and SetHealth methods for API devices |
| go.mod | Updated dependency on golang.org/x/sys |
| README.md | Updated documentation with the new health check configuration options |
Signed-off-by: William Zhao <[email protected]> Co-authored-by: Copilot <[email protected]>
|
@adrianchiris could you please take a look? This would be useful when the NVIDIA DPU is not managed by the DPF operator. |
| | "resourcePrefix" | N | Endpoint resource prefix name override. Should not contain special characters | string Default : "intel.com" | "yourcompany.com" | | ||
| | "deviceType" | N | Device Type for a resource pool. | string value of supported types. Default: "netDevice" | Currently supported values: "accelerator", "netDevice", "auxNetDevice" | | ||
| | "excludeTopology" | N | Exclude advertising of device's NUMA topology | bool Default: "false" | "excludeTopology": true | | ||
| | "checkHealthOnPf" | N | Check the health of a net device by inspecting the link state of the PF | bool Default: "false" | "checkHealthOnPf": true | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a list ? that way we dont overcumber the API
e.g
healthChecks: ["DeviceNetworkLink", "DeviceExists"] or similar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to add also something like this
healthChecks: ["DeviceNetworkLink", "DeviceExists"],
healthCheckInterval: 5,
Also today we have no implementation of probe it was running a sleep loop doing nothing.
let's change that to only start the probe go routine if there is something in the healthChecks list
| updateSignal: make(chan bool), | ||
| stopWatcher: make(chan bool), | ||
| checkIntervals: 20, // updates every 20 seconds | ||
| checkIntervals: 5, // updates every 5 seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Light be too short for Telco environments.
The implementation checks:
should be marked unhealthy. Normally, SR-IOV would still function when the
PF is carrier down. But in the case of DPUs/IPUs/SmartNics with an embedded
CPU, the PF being down can signal that the embedded CPU is in reset or
shutdown with carrier down.
the number of virtual functions. Or in the case of DPUs/IPUs/SmartNics with
an embedded CPU, the driver needed to reset. This will cause the virtual
functions to be removed. All devices that are gone should be marked
unhealthy. Normally this won't be the case since the SR-IOV Network Operator
will be managing the SR-IOV devices. However for DPUs/IPUs/SmartNics with
an embedded CPU, would be externally managed with a separate operator.
Both these can be switched on and off using checkHealthOnPf and
checkHealthOnDeviceExist within the resource config.