Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 8 additions & 2 deletions articles/aks/gpu-cluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
namespace: gpu-operator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary at the same time? Just wary of this tripping people up but I think it should be ok...

Copy link
Contributor Author

@JoeyC-Dev JoeyC-Dev Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In conlcusion: yes.
First thing first, in the original tutorial, it is already asking user to create gpu-operator namespace.
I can tell what that gpu-operator coming from, it is from: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html
image

Though k8s-plugin and gpu-operator are different things, I am still thinking it is necessary to let it having its own namespace. For example: istio has own aks-istio-system, approuting has app-routing-system. So why not give the separated one for NVIDIA/k8s-device-plugin?

The second and most important part is: when I install 3rd party plugin in kube-system: so far I remember, it will force inject additional env variables (which is unnecessary). For example:
image

I don't feel satisfied for this.

From the time I start realizing Pods in kube-system are different than others, I stop suggesting installing Pods in kube-system, when it is totally not necessary. So I did not choose this way to edit the document.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback, to align with our Windows GPU guidance the namespace for the k8s device plugin should be updated to gpu-resources.
@JoeyC-Dev can you please update line 131 to kubectl create namespace gpu-resources ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
namespace: gpu-operator
namespace: gpu-resources

spec:
selector:
matchLabels:
Expand All @@ -155,13 +155,19 @@ To use Azure Linux, you specify the OS SKU by setting `os-sku` to `AzureLinux` d
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
nodeSelector:
kubernetes.azure.com/accelerator: nvidia
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.0
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
Expand Down