Skip to content

Support both CDI and Legacy NVIDIA Container Runtime modes #459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

sky1122
Copy link
Contributor

@sky1122 sky1122 commented Apr 8, 2025

Issue number:

Closes #468

Description of changes:

  • Set DriverRoot and DefaultContainterDriverRoot in nvidia-k8s-device-plugin
  • Select the correct mode for the default NVIDIA container runtime, based off the configurations set for the device-list-strategy in the k8s device plugin
  • Disable mode detection in the prestart hook, as it prevents the NVIDIA legacy runtime from working when the mode for the default NVIDIA runtime is set to cdi.

Testing done:
In combination with:

In variants: aws-k8s-1.29, aws-k8s-1.31, aws-k8s-1.32, aws-k8s-1.33 for x86_64 and aarch64:

  • Verified that all the containerd runtimes are functional
  • Verified that the correct mode is configured depending on the device-list-strategy values:
API Value Mode Status
envvar - - legacy Pass
envvar volume-mounts - legacy Pass
envvar volume-mounts cdi-cri legacy Pass
envvar cdi-cri - legacy Pass
ennvar cdi-cri volume-mounts legacy Pass
volume-mounts - - legacy Pass
volume-mounts envvar - legacy Pass
volume-mounts envvar cdi-cri legacy Pass
volume-mounts cdi-cri - legacy Pass
volume-mounts cdi-cri envvar legacy Pass
cdi-cri - - cdi Pass
cdi-cri volume-mounts - cdi Pass
cdi-cri volume-mounts envvar cdi Pass
cdi-cri envvar - cdi Pass
cdi-cri envvar volume-mounts cdi Pass

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from 6f2376d to e67fc04 Compare April 8, 2025 21:30
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 8, 2025

force pushed to drop unrelated commits

@sky1122 sky1122 requested a review from arnaldo2792 April 8, 2025 21:34
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch 2 times, most recently from 2deed72 to f6daa5d Compare April 9, 2025 23:55
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 9, 2025

Forced push to change the commit content and add subject for patch

@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from f6daa5d to 74d6c03 Compare April 11, 2025 18:25
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 11, 2025

Forced push to add sign off in one commit

@sky1122
Copy link
Contributor Author

sky1122 commented Apr 11, 2025

forced push to add all the changes from #467 to this PR

@sky1122 sky1122 changed the title Packages: enable cdi in k8s-device-plugin and containerd Packages: add support for CDI in k8s Apr 14, 2025
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from 3325142 to aab387b Compare April 16, 2025 16:59
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 16, 2025

force-pushed to fix the issue in the conversation

@sky1122 sky1122 requested a review from arnaldo2792 April 16, 2025 18:45
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from aab387b to 6f6c507 Compare April 18, 2025 17:32
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 18, 2025

force-pushed to address the conversation above.

@sky1122 sky1122 requested a review from arnaldo2792 April 18, 2025 18:00
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch 2 times, most recently from 25ea570 to af3be11 Compare April 21, 2025 19:48
@sky1122 sky1122 added bug Something isn't working and removed bug Something isn't working labels Apr 21, 2025
@sky1122 sky1122 marked this pull request as ready for review April 21, 2025 21:23
@sky1122 sky1122 requested review from bcressey, ytsssun and koooosh April 21, 2025 21:23
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from af3be11 to ee2f138 Compare April 24, 2025 00:01
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 24, 2025

force-pushed to add a new patch in nvidia-k8s-device-plugin

@sky1122 sky1122 removed the request for review from koooosh April 24, 2025 00:06
@sky1122 sky1122 requested a review from KCSesh April 24, 2025 00:06
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from ee2f138 to 1145096 Compare April 24, 2025 00:07
@sky1122
Copy link
Contributor Author

sky1122 commented Apr 24, 2025

force-pushed to change one commit message


[nvidia-container-runtime]
{{#if settings.kubelet-device-plugins.nvidia.device-list-strategy}}
{{#if (eq settings.kubelet-device-plugins.nvidia.device-list-strategy "cdi-cri")}}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting mode="cdi" heres is correct, but possibly for the wrong reasons.

Note that if the mode is cdi-cri, the nvidia-container-runtime will not perform the injection since it cannot read the CRI fields set by the plugin. This means that the nvidia-container-runtime performs no modification to the OCI runtime specification and forwards the command to runc.

In the case where cdi-cri is used one does not strictly need the nvidia runtime at all. In the case of the GPU Operator we "solve" this by:

  1. Not setting the nvidia runtime as the default in containerd.
  2. Add a RuntimeClass for the nvidia runtime.
  3. Ensuring that "management" containers such as the k8s-device-plugin that themselves need access to GPUs use this runtime class.

Copy link
Contributor

@arnaldo2792 arnaldo2792 Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @elezar, thanks again for the interest in this PR!

Note that if the mode is cdi-cri, the nvidia-container-runtime will not perform the injection since it cannot read the CRI fields set by the plugin. This means that the nvidia-container-runtime performs no modification to the OCI runtime specification and forwards the command to runc.

We use cdi-cri to let the Kubernetes Device Plugin generate the CDI specifications, and we enabled the CDI support in containerd to perform the CDI updates. So nvidia-container-runtime effectively does nothing since all the updates were already applied by containerd.

The reason why we used nvidia-container-runtime at all is to allow users use NVIDIA_VISIBLE_DEVICES since, even though we understand that this environment variable is meant to be used solely for "management" containers, we have seen Bottlerocket users rely on this "feature" to allow over-subscription of the GPUs. As I said elsewhere, we do advice against this but we also don't want to break users that heavily rely on this type of oversubscription.

What we want to achieve with the setup that we have is a "hands-free" experience, where the users just create their pod specifications without having to worry about the RuntimeClass they have to use as we have it today with the legacy support. Additionally, we wanted to stop injecting the prestart hook even from management containers to fully rely on CDI.

Please let us know if you think there are problems with this approach. FWIW, we don't run the k8s device plugin as a daemonset, we run it as a service in the host but users own running other "management" containers like dcgm-exporter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to circle back to this discussion, we decided to include the additional runtimes to be more compatible with the GPU Operator. Thanks for the feedback @elezar!

The NVIDIA Device Plugin looks for libraries and binaries at `/driver-root`
because it expects to be running as a container where the driver root is
mounted at that location.

In Bottlerocket, the driver root is the actual root filesystem, so use that instead of the default value.

Signed-off-by: Jingwei Wang <[email protected]>
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch 2 times, most recently from e4af0a8 to 071452d Compare May 14, 2025 23:21
@arnaldo2792 arnaldo2792 marked this pull request as draft May 15, 2025 23:56
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch 2 times, most recently from e06a3ce to 91230e2 Compare May 16, 2025 17:08
@sky1122
Copy link
Contributor Author

sky1122 commented May 16, 2025

force pushed to change the commit format and comment to make it looks better

Update the NVIDIA Container Toolkit configuration template to dynamically
select the mode based off the configurations used in the Kubernetes Device
Plugin.

Signed-off-by: Jingwei Wang <[email protected]>
@sky1122 sky1122 force-pushed the enable-cdi-k8s-containerd branch from 91230e2 to 98c5634 Compare May 16, 2025 20:24
@sky1122
Copy link
Contributor Author

sky1122 commented May 16, 2025

force pushed to fix the handle bar

The 'default' handlebars helper doesn't work with arrays. Adjust how the
device-list-strategy property is set depending on the type used in the
API settings. Default to 'volume-mounts' when the setting is missing.

Signed-off-by: Jingwei Wang <[email protected]>
Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792 arnaldo2792 force-pushed the enable-cdi-k8s-containerd branch from 98c5634 to 972c58c Compare May 19, 2025 14:58
The prestart hook attempts to detect the mode used by the NVIDIA
container runtime, and fails if the detected mode is different than
'legacy'. This breaks the NVIDIA legacy runtime when the default mode is
set to 'cdi'.

Prevent the prestart hook from detecting the runtime mode, as using
'cdi' as the default mode while using the NVIDIA legacy runtime is a
valid configuration.

Signed-off-by: Arnaldo Garcia Rincon <[email protected]>
@arnaldo2792
Copy link
Contributor

Last push includes:

  • Add missing "header" to the k8s device plugin configuration
  • Disable mode detection for prestart hook

@arnaldo2792 arnaldo2792 marked this pull request as ready for review May 19, 2025 17:11
Comment on lines +19 to +20
{{~#if (eq settings.kubelet-device-plugins.nvidia.device-list-strategy.[0] "cdi-cri") ~}}
mode="cdi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that "cdi-cri" will always be the first item of the array? What is the behavior if we have multiple items in the array, and what if "cdi-cri" is the second item in the array?

May not be blocking. But it may be good to implement some helper like "has/contains" - similar to what they did in the device plugin - https://github.com/NVIDIA/k8s-device-plugin/blob/6f41f70c43f8da1357f51f64cf60431acc74141f/deployments/helm/nvidia-device-plugin/templates/_helpers.tpl#L178.

Also a note here - I was going to warn the index out of bound concern, but looks like the if helper does check for empty list and treat it as false, so using "0" is safe here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the conversation about custom helper, we had one, we were advice not to add it at least for this case:

#502

Comment on lines +21 to +33
{{~else~}}
mode="legacy"
{{~/if~}}
{{~else~}}
{{~#if (eq settings.kubelet-device-plugins.nvidia.device-list-strategy "cdi-cri") ~}}
mode="cdi"
{{~else~}}
mode="legacy"
{{~/if~}}
{{/if}}
{{else}}
mode="legacy"
{{/if}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love how the nested if else block here and multiple repeated `mode="legacy".

It looks like the only cases we want to set "cdi" is when the device-list-strategy is set as "cdi-cri" (in string or list form). Could we clean up the if-else logic here?

On another note, this along with the comment I have above, it might be worth consider introducing a dedicated "device-list-strategy" helper? The logic will be simplified in rust code.

Our custom helpers - https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/schnauzer/src/v2/import/helpers.rs#L36-L88

@arnaldo2792 arnaldo2792 changed the title Packages: add support for CDI in k8s Support both CDI and Legacy NVIDIA Container Runtime modes May 19, 2025
@arnaldo2792 arnaldo2792 merged commit d0ea56c into bottlerocket-os:develop May 19, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for CDI in Kubernetes
6 participants