-
Notifications
You must be signed in to change notification settings - Fork 94
Add GPU health check #689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add GPU health check #689
Changes from 2 commits
0e5dd5e
a7c85c7
896ef3a
599fb15
ae7211e
2d7618d
1a950ca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
| @@ -0,0 +1,281 @@ | ||||||
| /* | ||||||
| * Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved. | ||||||
guptaNswati marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
| * | ||||||
| * Licensed under the Apache License, Version 2.0 (the "License"); | ||||||
| * you may not use this file except in compliance with the License. | ||||||
| * You may obtain a copy of the License at | ||||||
| * | ||||||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||||||
| * | ||||||
| * Unless required by applicable law or agreed to in writing, software | ||||||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||||||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||||||
| * See the License for the specific language governing permissions and | ||||||
| * limitations under the License. | ||||||
| */ | ||||||
|
|
||||||
| package main | ||||||
|
|
||||||
| import ( | ||||||
| "context" | ||||||
| "fmt" | ||||||
| "strconv" | ||||||
| "strings" | ||||||
| "sync" | ||||||
|
|
||||||
| "github.com/NVIDIA/go-nvml/pkg/nvml" | ||||||
| "k8s.io/klog/v2" | ||||||
| ) | ||||||
|
|
||||||
| const ( | ||||||
| FullGPUInstanceID uint32 = 0xFFFFFFFF | ||||||
| ) | ||||||
|
|
||||||
| type deviceHealthMonitor struct { | ||||||
|
||||||
| type deviceHealthMonitor struct { | |
| type nvmlDeviceHealthMonitor struct { |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't just create the health monitor, it also starts it. Why not separate construction from actually starting the monitor? This will allow us to more easily swap out monitors (e.g. with NVSentinel) once we have those ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QUestion: What functionality is required for a DeviceHealthMonitor from the perspective of the driver. Would returning the following interface here make sense:
type DeviceHealthMonitor interface {
Start() error
Stop()
Unhealthy() <-chan *AllocatableDevice
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes an interface will allow more flexibility in case we want to integrate with other tools and for testing also but right now we only have NVML based implementation. A struct is more simpler and direct. Once we have the generic health API (which will be soon), all this will go away anyway as driver will not be responsible for starting or stopping the monitor, rather it will either talk to a grpc or http endpoint.
same logic to constructing and starting. Though for readability it may be better to separate the constructor and start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Does allocatable change at all, or is it constant for the lifetime of the plugin?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do update the health status of an allocatable when we get a unhealthy notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that wasn't clear. Does the content of allocatable change in any way that would invalidate the map that we construct here meaning that it would need to be reconstructed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the only changing content of an allocatable device is its health status and that wont impact it. This map ([uuid] = device) is constructed in the very beginning when the plugin is started.
and we iterate on it to register event on each device (currently, we dont check any status here. we may do it in future on remediation) and send unhealthy notification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the device plugin implementation setting skippedIDs to all disables health checking. Why is this logic different here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
health checking is feature-gated here, so it can be skipped without this flag. I should add a comment to make it explicit
guptaNswati marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
guptaNswati marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is probably like this in the device plugin too, but should we at least log an error here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed to look this up to understand what it's doing.
Ref docs are here: https://docs.nvidia.com/deploy/nvml-api/group__nvmlEvents.html#group__nvmlEvents_1g9714b0ca9a34c7a7780f87fee16b205c
argument is timeout in ms.
errors to be handled:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem right.
Let us specifically handle the NVML_ERROR_GPU_IS_LOST case, and perform this 'nuke option' only then. Then we can also have a more precise log message (emitted on error level).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why only NVML_ERROR_GPU_IS_LOST If we are not just checking !NVML_SUCCESS, TIMEOUT and UNKNOWN also seems imp.
If you see the returns of all event methods, these all seem common error types. I can do a helper of errortype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like https://github.com/NVIDIA/go-nvml/blob/main/pkg/nvml/return.go#L42
but then we mark the device unhealthy in each case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
while this will allow for proper error handling, this will also deviate from how its done in device-plugin. i can take it up as a follow-up to fix in both.
@jgehrcke thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging all the reviewers to help resolve this. Most other seems to be non-blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will also deviate from how its done in device-plugin.
Let's get this PR into the state we want it for the DRA driver without trying to match the device plugin implementation exactly. That is to say, consider this the next iteration of the device plugin implementation. The learnings that we take from this should be applied to the device plugin and iterated on from there.
One motivation is that this is a new feature and we don't have users currently expecting a certain behaviour. This gives us a lot more flexibility to change behaviour than if we had an existing implementation in use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is implementation detail of how we want to handle NVML return errors so it wont impact the end user anyway whether we do it similar to device-plugin or improve it here. For user, all these error means unhealthy device and wont be published as part of ResourceSlice. And IMO, ret != nvml.SUCCESS is a valid check and covers a wide range of errors as all subsequent calls are dependent on this check.
For me, these make sense when we have proper remediation in place where based on the given error, the right action can be recommended.
guptaNswati marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
shivamerla marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
guptaNswati marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(also for the device plugin) Can we track the follow-up action of checking why we don't check other supported types? Do whe have any indication of whether we ever see the log message below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have an example for this log message, how it would look like in practice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I1027 18:03:39.337167 1 device_health.go:179] Processing event {Device:{Handle:0xe151b6b2fef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems bit aggressive to mark all devices as unhealthy on one invalid event. Should we log this as error and continue watch? cc @klueska
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its how its done in device-plugin https://github.com/NVIDIA/k8s-device-plugin/blob/main/internal/rm/health.go#L147
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also say we should log an error and otherwise proceed. Even if what you've shown here is currently done in the device plugin.
By the way, this would have been a perfect opportunity for a better code comment in the legacy code:
No blame, no emotions -- but this code comment does not add information in addition to the code. The interesting bit would be if there is a specific, non-obvious reason / relevance for this style of treatment.
For example, I wonder if this code was introduced to fix a bug. I wonder if it is even ever exercised.
The way it's written and with the git blame history, it seems like it was potentially added initially (defensively) and may never have been exercised in production.
shivamerla marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
shivamerla marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
guptaNswati marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
guptaNswati marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
guptaNswati marked this conversation as resolved.
Show resolved
Hide resolved
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -550,6 +550,14 @@ func GetOpaqueDeviceConfigs( | |||||
| return resultConfigs, nil | ||||||
| } | ||||||
|
|
||||||
| func (s *DeviceState) UpdateDeviceHealthStatus(device *AllocatableDevice, healthstatus string) { | ||||||
| s.Lock() | ||||||
| defer s.Unlock() | ||||||
|
|
||||||
| device.Health = healthstatus | ||||||
| klog.Infof("Update device sattus:%s healthstatus", device.UUID()) | ||||||
|
||||||
| klog.Infof("Update device sattus:%s healthstatus", device.UUID()) | |
| klog.Infof("Update device status:%s healthstatus", device.UUID()) |
Also, did you mean to output the actual status?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will fix typo. no just the device.uuid
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos, spaces etc.
Can you please explain / give an impression of how often this message would be logged? What actions/events will trigger this message to be logged?
When we understand that, let's have a brief think about a suitable verbosity level.
One of the questions that I have here: is this only logged for a health flip? Or could this also be logged for healthy->healthy? Should we have a different message/level depending on the state transition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right now, its only logged when healthy becomes unhealthy. But in future when we have remediation, it will also change to unhealthy->healthy.
Uh oh!
There was an error while loading. Please reload this page.