Skip to content

Conversation

@yeazelm
Copy link
Contributor

@yeazelm yeazelm commented Oct 14, 2025

Description of changes:
The GRID driver requires a license to fully function. This currently is fetched as a best effort but a node can come up without a "Licensed" driver and fail later after some amount of time. This change a new file to confirm the license has been correctly acquired or is not needed. This includes a drop in for nvidia-k8s-device-plugin.service to depend on this file:

ExecStartPre=/usr/bin/test -f /etc/drivers/.grid-licensed

This will be before the ExecStart so that the device plugin will not register GPUs unless the license is not required or properly configured.

Testing done:
WIP - I'm updating the testing with the new approach for every use case but the primary use cases (GRID, open-gpu, and proprietary) all provide the correct experience.

Built this change and confirmed that it does not run on a G6 (which uses open-gpu, not GRID so this isn't required), fails to have the node become ready when nvidia-gridd is misconfigured to not start. And the nodes become ready as expected when nvidia-gridd starts running normally.

Normal g6f.xlarge

bash-5.1# journalctl -u grid-license-check.service
Dec 03 16:50:39 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 16:50:39 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: grid-license-check.service: Main process exited, code=exited, status=1/FAILURE
Dec 03 16:50:39 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
Dec 03 16:50:39 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: Failed to start GRID License Check.
Dec 03 16:50:40 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 16:50:41 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: grid-license-check.service: Deactivated successfully.
Dec 03 16:50:41 ip-192-168-52-46.us-west-2.compute.internal systemd[1]: Finished GRID License Check.
bash-5.1# ls -al /etc/drivers/.grid-licensed
-rw-r--r--. 1 root root 0 Dec  3 16:50 /etc/drivers/.grid-licensedt-2.compute.internal systemd[1]: Finished GRID License Check.

Broken nvidia-gridd results in the node becoming ready but in degraded systemd state and no GPUs offered:

bash-5.1# systemctl status
● ip-192-168-71-6.us-west-2.compute.internal
    State: degraded
    Units: 461 loaded (incl. loaded aliases)
     Jobs: 3 queued
   Failed: 1 units
    Since: Wed 2025-12-03 21:24:43 UTC; 46min ago
  systemd: 257.9
  Tainted: unmerged-bin
   CGroup: /


bash-5.1# journalctl -u grid-license-check.service
Dec 03 21:24:55 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 21:24:55 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Main process exited, code=exited, status=1/FAILURE
Dec 03 21:24:55 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
Dec 03 21:24:55 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Failed to start GRID License Check.
Dec 03 21:24:56 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 21:24:56 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Main process exited, code=exited, status=1/FAILURE
Dec 03 21:24:56 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
Dec 03 21:24:56 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Failed to start GRID License Check.
Dec 03 21:24:57 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 21:24:57 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Main process exited, code=exited, status=1/FAILURE
Dec 03 21:24:57 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
Dec 03 21:24:57 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Failed to start GRID License Check.
Dec 03 21:24:59 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 21:25:00 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Main process exited, code=exited, status=1/FAILURE
Dec 03 21:25:00 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
Dec 03 21:25:00 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Failed to start GRID License Check.
Dec 03 21:25:03 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 21:25:03 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Main process exited, code=exited, status=1/FAILURE
Dec 03 21:25:03 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
Dec 03 21:25:03 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: Failed to start GRID License Check.
Dec 03 21:25:06 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Start request repeated too quickly.
Dec 03 21:25:06 ip-192-168-71-6.us-west-2.compute.internal systemd[1]: grid-license-check.service: Failed with result 'exit-code'.
.... 
# this continues as long as the instances is up but no license exists

Output from journal on a g6.2xlarge which uses the fallback:

bash-5.1# journalctl -u grid-license-check.service
Dec 03 16:52:46 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 16:52:46 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: grid-license-check.service: Skipped due to 'exec-condition'.
Dec 03 16:52:46 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Condition check resulted in GRID License Check being skipped.
Dec 03 16:52:48 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Starting GRID License Check...
Dec 03 16:52:48 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: grid-license-check.service: Skipped due to 'exec-condition'.
Dec 03 16:52:48 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Condition check resulted in GRID License Check being skipped.
bash-5.1# journalctl -u open-gpu-license-fallback.service
Dec 03 16:52:47 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Starting Open GPU GRID License Check Fallback...
Dec 03 16:52:48 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Finished Open GPU GRID License Check Fallback.
bash-5.1# journalctl -u nvidia-k8s-device-plugin
Dec 03 16:52:48 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Starting Start NVIDIA kubernetes device plugin...
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal systemd[1]: Started Start NVIDIA kubernetes device plugin.
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal nvidia-device-plugin[3114]: I1203 16:52:49.175860    3114 main.go:235] "Starting NVIDIA Device Plugin" version="unknown"
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal nvidia-device-plugin[3114]: I1203 16:52:49.175889    3114 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal nvidia-device-plugin[3114]: I1203 16:52:49.175949    3114 main.go:245] Starting OS watcher.
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal nvidia-device-plugin[3114]: I1203 16:52:49.176276    3114 main.go:260] Starting Plugins.
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal nvidia-device-plugin[3114]: I1203 16:52:49.176288    3114 main.go:317] Loading configuration.
Dec 03 16:52:49 ip-192-168-84-108.us-west-2.compute.internal nvidia-device-plugin[3114]: I1203 16:52:49.177360    3114 main.go:342] Updating config with default resource matching patterns.
....

# The file exists to let the device plugin start
bash-5.1# ls -al /etc/drivers/.grid-licensed
-rw-r--r--. 1 root root 0 Dec  3 16:52 /etc/drivers/.grid-licensed

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@yeazelm yeazelm force-pushed the grid-license-check branch from 0e100c1 to a50ce29 Compare December 2, 2025 21:29
@yeazelm yeazelm changed the title Add dependency on the GRID license for kubelet Add dependency on the GRID license for NVIDIA k8s device plugin Dec 2, 2025
@yeazelm
Copy link
Contributor Author

yeazelm commented Dec 2, 2025

^ I updated the approach to use timers and a marker file. I'm working on all the new logs from testing to confirm the behavior works as described.

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this:

A change will need to be added to the nvidia-k8s-device-plugin.service to depend on this file

Can you add this as a drop-in so there's no cross-kit dependency?

Add a unit that checks for the license to be valid for GRID. The NVIDIA
k8s device plugin requires this unit so if the license is not present,
then the node never offers gpu resources. This prevents a situation
where a node could fail to get a license, join the cluster, and then
later have workloads start to fail due to the unlicensed status.

Signed-off-by: Matthew Yeazel <[email protected]>
Add a unit that checks for the license to be valid for GRID. The NVIDIA
k8s device plugin requires this unit so if the license is not present,
then the node never offers gpu resources. This prevents a situation
where a node could fail to get a license, join the cluster, and then
later have workloads start to fail due to the unlicensed status.

Signed-off-by: Matthew Yeazel <[email protected]>
@yeazelm
Copy link
Contributor Author

yeazelm commented Dec 11, 2025

^ Updated for the comments and added in the drop-in for nvidia-k8s-device-plugin.

Copy link
Contributor

@bcressey bcressey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor fixes, LGTM otherwise.

Add a unit that checks for the license to be valid for GRID. The NVIDIA
k8s device plugin requires this unit so if the license is not present,
then the node never offers gpu resources. This prevents a situation
where a node could fail to get a license, join the cluster, and then
later have workloads start to fail due to the unlicensed status.

Signed-off-by: Matthew Yeazel <[email protected]>
Add a unit that checks for the license to be valid for GRID. The NVIDIA
k8s device plugin requires this unit so if the license is not present,
then the node never offers gpu resources. This prevents a situation
where a node could fail to get a license, join the cluster, and then
later have workloads start to fail due to the unlicensed status.

Signed-off-by: Matthew Yeazel <[email protected]>
@yeazelm
Copy link
Contributor Author

yeazelm commented Dec 12, 2025

^ Fixed typos from comments/

@@ -0,0 +1,2 @@
[Service]
ExecStartPre=/usr/bin/test -f /etc/drivers/.grid-licensed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConditionPathExists will fail the unit, I want systemd to retry. If the unit happens to get scheduled before we have placed this file, it would just mark it as failed. The ExecStartPre matches the rest of the services dependencies. See https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/packages/nvidia-k8s-device-plugin/nvidia-k8s-device-plugin.service for the other pre conditions.

@yeazelm yeazelm merged commit bbc3e49 into bottlerocket-os:develop Dec 12, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants