Skip to content

Support both CDI and Legacy NVIDIA Container Runtime modes #459

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,14 +1,39 @@
[required-extensions]
nvidia-container-runtime = "v1"
std = { version = "v1", helpers = ["default"] }
kubelet-device-plugins = "v1"
std = { version = "v1", helpers = ["default", "is_array"] }

+++
### generated from the template file ###
accept-nvidia-visible-devices-as-volume-mounts = {{default true settings.nvidia-container-runtime.visible-devices-as-volume-mounts}}
accept-nvidia-visible-devices-envvar-when-unprivileged = {{default false settings.nvidia-container-runtime.visible-devices-envvar-when-unprivileged}}

[nvidia-container-cli]
root = "/"
path = "/usr/bin/nvidia-container-cli"
environment = []
ldconfig = "@/sbin/ldconfig"

[nvidia-container-runtime]
{{#if settings.kubelet-device-plugins.nvidia.device-list-strategy}}
{{~#if (is_array settings.kubelet-device-plugins.nvidia.device-list-strategy) ~}}
{{~#if (eq settings.kubelet-device-plugins.nvidia.device-list-strategy.[0] "cdi-cri") ~}}
mode="cdi"
Comment on lines +19 to +20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it expected that "cdi-cri" will always be the first item of the array? What is the behavior if we have multiple items in the array, and what if "cdi-cri" is the second item in the array?

May not be blocking. But it may be good to implement some helper like "has/contains" - similar to what they did in the device plugin - https://github.com/NVIDIA/k8s-device-plugin/blob/6f41f70c43f8da1357f51f64cf60431acc74141f/deployments/helm/nvidia-device-plugin/templates/_helpers.tpl#L178.

Also a note here - I was going to warn the index out of bound concern, but looks like the if helper does check for empty list and treat it as false, so using "0" is safe here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the conversation about custom helper, we had one, we were advice not to add it at least for this case:

#502

{{~else~}}
mode="legacy"
{{~/if~}}
{{~else~}}
{{~#if (eq settings.kubelet-device-plugins.nvidia.device-list-strategy "cdi-cri") ~}}
mode="cdi"
{{~else~}}
mode="legacy"
{{~/if~}}
{{/if}}
{{else}}
mode="legacy"
{{/if}}
Comment on lines +21 to +33
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love how the nested if else block here and multiple repeated `mode="legacy".

It looks like the only cases we want to set "cdi" is when the device-list-strategy is set as "cdi-cri" (in string or list form). Could we clean up the if-else logic here?

On another note, this along with the comment I have above, it might be worth consider introducing a dedicated "device-list-strategy" helper? The logic will be simplified in rust code.

Our custom helpers - https://github.com/bottlerocket-os/bottlerocket-core-kit/blob/develop/sources/api/schnauzer/src/v2/import/helpers.rs#L36-L88


[nvidia-container-runtime-hook]
# For the legacy NVIDIA runtime, skip detecting the mode used in the
# NVIDA Container Runtime. This prevents failures in the legacy NVIDIA runtime
# when the selected mode is 'cdi'.
skip-mode-detection = true
14 changes: 12 additions & 2 deletions packages/nvidia-k8s-device-plugin/nvidia-k8s-device-plugin-conf
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[required-extensions]
kubelet-device-plugins = "v1"
std = { version = "v1", helpers = ["default"] }
std = { version = "v1", helpers = ["default", "join_array", "is_array"] }

+++
version: v1
Expand All @@ -15,10 +15,20 @@ flags:
migStrategy: "none"
{{/if}}
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: {{default true settings.kubelet-device-plugins.nvidia.pass-device-specs}}
deviceListStrategy: {{default "volume-mounts" settings.kubelet-device-plugins.nvidia.device-list-strategy}}
{{#if settings.kubelet-device-plugins.nvidia.device-list-strategy}}
{{#if (is_array settings.kubelet-device-plugins.nvidia.device-list-strategy)}}
deviceListStrategy: [{{join_array "," settings.kubelet-device-plugins.nvidia.device-list-strategy }}]
{{else}}
deviceListStrategy: {{settings.kubelet-device-plugins.nvidia.device-list-strategy}}
{{/if}}
{{else}}
deviceListStrategy: "volume-mounts"
{{/if}}
deviceIDStrategy: {{default "index" settings.kubelet-device-plugins.nvidia.device-id-strategy}}
containerDriverRoot: "/"
{{#if settings.kubelet-device-plugins.nvidia.device-sharing-strategy}}
{{#if (eq settings.kubelet-device-plugins.nvidia.device-sharing-strategy "time-slicing")}}
sharing:
Expand Down