Skip to content

Conversation

@wizhaoredhat
Copy link

@wizhaoredhat wizhaoredhat commented Apr 9, 2025

The implementation checks:

  1. If the physical function PF of the SR-IOV devices is carrier down. This
    should be marked unhealthy. Normally, SR-IOV would still function when the
    PF is carrier down. But in the case of DPUs/IPUs/SmartNics with an embedded
    CPU, the PF being down can signal that the embedded CPU is in reset or
    shutdown with carrier down.
  2. If any of the devices are gone. This could be due to someone changing
    the number of virtual functions. Or in the case of DPUs/IPUs/SmartNics with
    an embedded CPU, the driver needed to reset. This will cause the virtual
    functions to be removed. All devices that are gone should be marked
    unhealthy. Normally this won't be the case since the SR-IOV Network Operator
    will be managing the SR-IOV devices. However for DPUs/IPUs/SmartNics with
    an embedded CPU, would be externally managed with a separate operator.

Both these can be switched on and off using checkHealthOnPf and
checkHealthOnDeviceExist within the resource config.

@wizhaoredhat
Copy link
Author

@zeeke / @SchSeba PTAL

@wizhaoredhat wizhaoredhat force-pushed the add_health_check branch 3 times, most recently from 4ba69ac to 6a5c22c Compare April 9, 2025 00:40
@coveralls
Copy link
Collaborator

coveralls commented Apr 9, 2025

Pull Request Test Coverage Report for Build 15006985984

Details

  • 51 of 75 (68.0%) changed or added relevant lines in 6 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.2%) to 74.306%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/devices/gen_pci.go 0 3 0.0%
pkg/devices/gen_net.go 0 6 0.0%
pkg/netdevice/netResourcePool.go 47 53 88.68%
pkg/devices/api.go 0 9 0.0%
Totals Coverage Status
Change from base Build 14837814551: -0.2%
Covered Lines: 2169
Relevant Lines: 2919

💛 - Coveralls

Copy link
Member

@zeeke zeeke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor comments

@wizhaoredhat
Copy link
Author

@zeeke Please take a look at the changes again.

@wizhaoredhat
Copy link
Author

@adrianchiris PTAL

@SchSeba
Copy link
Collaborator

SchSeba commented Apr 23, 2025

Hi @wizhaoredhat,
please check the CI failed

Copy link
Collaborator

@SchSeba SchSeba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good check two small comments.

a following question is where are we going to use this feature? it needs to be part of the sriov-network-operator or a different operator that deploys a device plugin?

}
}

if pfIsUp && deviceExists && !currentHealth {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we can make this long if else statement simpler? took me some time to get it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look now at the reduce if statement. Let me know if it is better.

Generated updated mocks with "make generate-mocks"

Signed-off-by: William Zhao <[email protected]>
The APIs will be used to determine if a device is healthy or not.

Generated updated mocks with "make generate-mocks"

Signed-off-by: William Zhao <[email protected]>
The implementation checks:
1. If the physical function PF of the SR-IOV devices is carrier down. This
should be marked unhealthy. Normally, SR-IOV would still function when the
PF is carrier down. But in the case of DPUs/IPUs/SmartNics with an embedded
CPU, the PF being down **can** signal that the embedded CPU is in reset or
shutdown with carrier down.
2. If any of the devices are gone. This could be due to someone changing
the number of virtual functions. Or in the case of DPUs/IPUs/SmartNics with
an embedded CPU, the driver needed to reset. This will cause the virtual
functions to be removed. All devices that are gone should be marked
unhealthy. Normally this won't be the case since the SR-IOV Network Operator
will be managing the SR-IOV devices. However for DPUs/IPUs/SmartNics with
an embedded CPU, would be externally managed with a separate operator.

Both these can be switched on and off using checkHealthOnPf and
checkHealthOnDeviceExist within the resource config.

Signed-off-by: William Zhao <[email protected]>
@wizhaoredhat
Copy link
Author

overall looks good check two small comments.

a following question is where are we going to use this feature? it needs to be part of the sriov-network-operator or a different operator that deploys a device plugin?

Some operators use the Device Plugin with the SR-IOV Network Operator. I think some NVIDIA operators do this. But for the DPU Operator, I plan to include this change and use reuse the SR-IOV device plugin. DPUs and Smart NICs would benefit with this change since DPUs or SmartNICs can be reset/reboot/shutdown asynchronously.

One thing I dislike is the fact that kubernetes doesn't put the pod into a failed state when the underlying resource allocation device becomes unhealthy. I think the expectation of k8s is to use liveliness probes. I plan to dig a little bit more on this.

I think these flags included now in the Config Map can be exposed to the Sriov Network Operator API after this change has been merged. I think the SriovNetworkNodePolicy? Just wondering if you think this would be useful for the SR-IOV Network Operator or not.

@SchSeba SchSeba requested a review from Copilot May 27, 2025 13:10
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a health check mechanism for net devices by evaluating both the physical function (PF) link status and the existence of PCI devices, and it updates several mock implementations and configuration structs to support this functionality.

  • Renames and updates function calls in utils.go (deviceExist → DeviceExist)
  • Enhances resource configuration and interfaces in types.go with health check flags and methods (CheckHealthOnPf and CheckHealthOnDeviceExist)
  • Adds a new Probe function in netResourcePool.go and corresponding tests in netdevice/netResourcePool_test.go, along with updates in various mocks and device implementations

Reviewed Changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pkg/utils/utils.go Renamed function to public DeviceExist to reflect its intended API usage
pkg/types/types.go Added new health check configuration flags and methods in device interfaces
pkg/types/mocks/* Added mock implementations for GetHealth, SetHealth, and DeviceExists
pkg/resources/server.go Reduced the check interval from 20 to 5 seconds
pkg/netdevice/netResourcePool.go Implemented the Probe function to update device health based on PF and device existence checks
pkg/netdevice/netResourcePool_test.go Added extensive tests for various health check scenarios
pkg/devices/gen_pci.go Added DeviceExists method using the DeviceExist utility function
pkg/devices/gen_net.go Added IsPfLinkUp method to inspect PF link status using new unix flags
pkg/devices/api.go Added GetHealth and SetHealth methods for API devices
go.mod Updated dependency on golang.org/x/sys
README.md Updated documentation with the new health check configuration options

@wizhaoredhat
Copy link
Author

@adrianchiris could you please take a look? This would be useful when the NVIDIA DPU is not managed by the DPF operator.

| "resourcePrefix" | N | Endpoint resource prefix name override. Should not contain special characters | string Default : "intel.com" | "yourcompany.com" |
| "deviceType" | N | Device Type for a resource pool. | string value of supported types. Default: "netDevice" | Currently supported values: "accelerator", "netDevice", "auxNetDevice" |
| "excludeTopology" | N | Exclude advertising of device's NUMA topology | bool Default: "false" | "excludeTopology": true |
| "checkHealthOnPf" | N | Check the health of a net device by inspecting the link state of the PF | bool Default: "false" | "checkHealthOnPf": true |
Copy link
Contributor

@adrianchiris adrianchiris Jun 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a list ? that way we dont overcumber the API

e.g
healthChecks: ["DeviceNetworkLink", "DeviceExists"] or similar

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add also something like this

healthChecks: ["DeviceNetworkLink", "DeviceExists"],
healthCheckInterval: 5,

Also today we have no implementation of probe it was running a sleep loop doing nothing.
let's change that to only start the probe go routine if there is something in the healthChecks list

updateSignal: make(chan bool),
stopWatcher: make(chan bool),
checkIntervals: 20, // updates every 20 seconds
checkIntervals: 5, // updates every 5 seconds
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Light be too short for Telco environments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants