-
Notifications
You must be signed in to change notification settings - Fork 94
Gpu health check #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpu health check #545
Conversation
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
|
current log: resourceclaim status update is still broken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.
- Introduces device health status tracking with
Healthy/Unhealthystates - Implements NVML event-based health monitoring for GPU devices
- Updates resource claim status to reflect device health conditions
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| cmd/gpu-kubelet-plugin/nvlib.go | Initialize all devices with Healthy status |
| cmd/gpu-kubelet-plugin/driver.go | Add device health monitor initialization and health notification handling |
| cmd/gpu-kubelet-plugin/device_state.go | Add device health status updates and resource claim status reporting |
| cmd/gpu-kubelet-plugin/device_health.go | New file implementing NVML event-based health monitoring |
| cmd/gpu-kubelet-plugin/allocatable.go | Add health status field and methods to AllocatableDevice |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
cmd/gpu-kubelet-plugin/driver.go
Outdated
| if err != nil { | ||
| return nil, fmt.Errorf("start deviceHealthMonitor: %w", err) | ||
| } | ||
| klog.Info("[SWATI DEBUGS] Started device health monitor") |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.
| klog.Info("[SWATI DEBUGS] Started device health monitor") | |
| klog.Info("[SWATI DEBUG] Started device health monitor") |
cmd/gpu-kubelet-plugin/driver.go
Outdated
| var resourceSlice resourceslice.Slice | ||
| for _, dev := range d.state.allocatable { | ||
| if dev.IsHealthy() { | ||
| klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev) |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the log message: 'resoureslice' should be 'resourceslice'.
| klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev) | |
| klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev) |
cmd/gpu-kubelet-plugin/driver.go
Outdated
| } | ||
|
|
||
| // Republish updated resources | ||
| klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices") |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the log message: 'rebulishing' should be 'republishing'.
| klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices") | |
| klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices") |
| Config: configapi.DefaultMigDeviceConfig(), | ||
| }) | ||
|
|
||
| // Swati: Add resourceclaim status update |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.
| // Swati: Add resourceclaim status update | |
| // Add resource claim status update to track device health. |
| // Swati add health check | ||
| klog.Info("[SWATI DEBUG] adding device status") |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.
| // Swati add health check | |
| klog.Info("[SWATI DEBUG] adding device status") | |
| // Add health status to device allocation result |
| //defer nvdevlib.alwaysShutdown() | ||
|
|
||
| //klog.Info("[SWATI DEBUG] getting all devices..") | ||
| //allocatable, err := nvdevlib.enumerateAllPossibleDevices(config) | ||
| //if err != nil { | ||
| // return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | ||
| //} | ||
|
|
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.
| //defer nvdevlib.alwaysShutdown() | |
| //klog.Info("[SWATI DEBUG] getting all devices..") | |
| //allocatable, err := nvdevlib.enumerateAllPossibleDevices(config) | |
| //if err != nil { | |
| // return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | |
| //} |
| } | ||
|
|
||
| func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) { | ||
| klog.Info("[SWATI DEBUG] initializing NVML..") |
Copilot
AI
Sep 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).
| klog.Info("[SWATI DEBUG] initializing NVML..") | |
| klog.Info("[SWATI DEBUG] initializing NVML") |
|
More logs after fixing republish of resourceslice when unhealthy gpu found |
|
need to fix resourceclaim status update: not using the right client api |
|
Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it. |
|
Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet. |
Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly. |
Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266 |
|
Got the resourceclaim status to be updated |
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
717656d to
d1852f0
Compare
Signed-off-by: Swati Gupta <[email protected]>
|
Updated action on health event: update device condition to unhealthy in resourceclaim status |
|
@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state. |
|
test of skipped xid: |
Signed-off-by: Swati Gupta <[email protected]>
| func (s *DeviceState) MarkDeviceUnhealthy(device *AllocatableDevice) { | ||
| // SWATI: check if a mig device is marked properly | ||
| s.Lock() | ||
| defer s.Unlock() | ||
|
|
||
| device.Health = Unhealthy | ||
| klog.Infof("Marked device:%s unhealthy", device.GetUUID()) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this take health as a parameter so it can be reused once we have the ability to bring a device back to healthy?
| SearchDeviceGroups: | ||
| for _, group := range preparedClaim.PreparedDevices { | ||
| for _, device := range group.Devices { | ||
| var currentUUID string | ||
| var currentDeviceName string | ||
|
|
||
| if device.Gpu != nil { | ||
| currentUUID = device.Gpu.Info.UUID | ||
| currentDeviceName = device.Gpu.Device.DeviceName | ||
| } else if device.Mig != nil { | ||
| currentUUID = device.Mig.Info.UUID | ||
| currentDeviceName = device.Mig.Device.DeviceName | ||
| } | ||
|
|
||
| if currentUUID == unhealthyDeviceUUID { | ||
| klog.V(6).Infof("found matching device: %v for claim: %s", currentDeviceName, claimUID) | ||
| matchingDeviceName = currentDeviceName | ||
| break SearchDeviceGroups | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid the labeled break by putting this in a function and returning at the point that you find what you are looking for.
| } | ||
|
|
||
| func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) { | ||
| claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to work not just for the v1beta1 api, but all of v1, v1beta1, and v1beta2. We have helpers to do that in the staging repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I was not sure how to do it. link?
| } | ||
|
|
||
| func (d *driver) findClaimByUID(ctx context.Context, claimUID types.UID) (*v1beta1.ResourceClaim, error) { | ||
| claimList, err := d.state.config.clientsets.Core.ResourceV1beta1().ResourceClaims("").List(ctx, metav1.ListOptions{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how I feel about listing all Resource claims and then searching through them for the matching UID. It feels like we should rather be storing the claim name / namespace in the checkpoint so we can pull it directly and then assert that it has the correct UID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes yes, this is not efficient, i dint want to make changes to the existing code as we are not sure if we want to take this action or not. This is more of a sample action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we may update checkpoint to add claim name/namespace for faster lookup for claims status update of the unhealthy device, so that there is no need to iterate on all claims and find the needed one.
For another usecase, its already gettimg updated for computedomain.
| if err := d.state.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, ds); err != nil { | ||
| klog.Errorf("Failed to update status for claim %s/%s: %v", claim.Namespace, claim.Name, err) | ||
| } else { | ||
| klog.V(6).Infof("applied unhealthy device status to claim %s/%s", claim.Namespace, claim.Name) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should just give up here. Its likely we may have a conflict when writing and we need to try again. Instead we should push a task to the workqueue that will keep retrying until it succceeds.
|
|
||
| claim, err := d.findClaimByUID(ctx, types.UID(claimUID)) | ||
| if err != nil { | ||
| klog.Errorf("Failed to find ResourceClaim object for UID %s: %v", claimUID, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessarily an error.
| // Update allocated device health status in a given claim | ||
| result, claimUID, err := d.findDeviceResultAndClaimUID(uuid) | ||
| if err != nil { | ||
| klog.Errorf("Device %s is unhealthy, but no associated claim was found: %v", uuid, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not necessarily an error.
| klog.V(6).Infof("Adding devices health status to claim %s/%s", claim.Namespace, claim.Name) | ||
| if err := s.applyClaimDeviceStatuses(ctx, claim.Namespace, claim.Name, deviceStatuses...); err != nil { | ||
| klog.Warningf("Failed to update devices status for claim %s/%s: %v", claim.Namespace, claim.Name, err) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level question -- who is going to be reading this status from the ResourceClaim and doing anything with it? I know I suggested looking at DRANet and seeing how they were reporting their health status, but does it even make sense to do this in our driver? Why does DRANet need to do it?
/cc @aojea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DRANET uses the standardized fields for network information on the status, so cni-dra-driver
Using standard data for reporting this information in status allow us to build tooling and applications on top, specially useful for some use cases of multi networking or monitoring
As discussed in the team meeting. We need to bring this back for the intial release of these health checks. The "right" way to marm the GPUs as unhealthy will be with device taints, but those are still an alpha feature, and we need someway to mark them as unhealthy / unschedulable in the interim. |
| d.state.MarkDeviceUnhealthy(device) | ||
|
|
||
| // Update allocated device health status in a given claim | ||
| result, claimUID, err := d.findDeviceResultAndClaimUID(uuid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does d.findDeviceResultAndClaimUID(uuid) only operate on local file system state (checkpoint data)? Or is any networking interaction or IPC involved?
I would find it helpful to clarify that in the code comment right above that call, or to even reflect that in the method name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local checkpoint data.
| continue | ||
| } | ||
|
|
||
| claim, err := d.findClaimByUID(ctx, types.UID(claimUID)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a code comment here explaining why at this point it is relevant to look up the full claim object?
That code comment would clarify the importance of having that data at all, and having it fresh. It would also clarify
- how much of a problem it is if we do not find the claim
- how much of a problem it is to use an outdated version of the claim object
That's so important to have clarified clearly, and maybe we want to discuss that goal statement (specification) before getting too deep into the weeds of the error handling / retrying behavior implementation review.
For this to work, we also need a reconciliation to bring them back once there is remediation and GPU is healthy again. |
|
Quick Mig test logs: |
Signed-off-by: Swati Gupta <[email protected]>
|
Closing this in favor of #689 |
Addressing #360 to add preliminary health check