-
Notifications
You must be signed in to change notification settings - Fork 37
Add dependency on the GRID license for NVIDIA k8s device plugin #294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0e100c1 to
a50ce29
Compare
|
^ I updated the approach to use timers and a marker file. I'm working on all the new logs from testing to confirm the behavior works as described. |
bcressey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this:
A change will need to be added to the nvidia-k8s-device-plugin.service to depend on this file
Can you add this as a drop-in so there's no cross-kit dependency?
Add a unit that checks for the license to be valid for GRID. The NVIDIA k8s device plugin requires this unit so if the license is not present, then the node never offers gpu resources. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
Add a unit that checks for the license to be valid for GRID. The NVIDIA k8s device plugin requires this unit so if the license is not present, then the node never offers gpu resources. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
a50ce29 to
0214fcb
Compare
|
^ Updated for the comments and added in the drop-in for nvidia-k8s-device-plugin. |
bcressey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two minor fixes, LGTM otherwise.
Add a unit that checks for the license to be valid for GRID. The NVIDIA k8s device plugin requires this unit so if the license is not present, then the node never offers gpu resources. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
Add a unit that checks for the license to be valid for GRID. The NVIDIA k8s device plugin requires this unit so if the license is not present, then the node never offers gpu resources. This prevents a situation where a node could fail to get a license, join the cluster, and then later have workloads start to fail due to the unlicensed status. Signed-off-by: Matthew Yeazel <[email protected]>
0214fcb to
8f80e75
Compare
|
^ Fixed typos from comments/ |
| @@ -0,0 +1,2 @@ | |||
| [Service] | |||
| ExecStartPre=/usr/bin/test -f /etc/drivers/.grid-licensed | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not ConditionPathExists?
https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#ConditionPathExists=
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConditionPathExists will fail the unit, I want systemd to retry. If the unit happens to get scheduled before we have placed this file, it would just mark it as failed. The ExecStartPre matches the rest of the services dependencies. See https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/packages/nvidia-k8s-device-plugin/nvidia-k8s-device-plugin.service for the other pre conditions.
Description of changes:
The GRID driver requires a license to fully function. This currently is fetched as a best effort but a node can come up without a "Licensed" driver and fail later after some amount of time. This change a new file to confirm the license has been correctly acquired or is not needed. This includes a drop in for nvidia-k8s-device-plugin.service to depend on this file:
This will be before the
ExecStartso that the device plugin will not register GPUs unless the license is not required or properly configured.Testing done:
WIP - I'm updating the testing with the new approach for every use case but the primary use cases (GRID, open-gpu, and proprietary) all provide the correct experience.
Built this change and confirmed that it does not run on a G6 (which uses open-gpu, not GRID so this isn't required), fails to have the node become ready when
nvidia-griddis misconfigured to not start. And the nodes become ready as expected whennvidia-griddstarts running normally.Normal g6f.xlarge
Broken nvidia-gridd results in the node becoming ready but in degraded systemd state and no GPUs offered:
Output from journal on a g6.2xlarge which uses the fallback:
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.