-
Notifications
You must be signed in to change notification settings - Fork 754
enable device plugin specific node selector #1467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: omrishiv <[email protected]>
|
|
||
| devicePlugin: | ||
| enabled: true | ||
| nodeSelector: {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one question would be whether we wanted to do this for other top-level placement selectors such as affinity and also do the same for gfd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gfd is useful for all nvidia nodes, so I didn't see a need to extend it, but I'm happy to add it so it's usable if you want it
| {{- $useServiceAccount := $options.hasConfigMap }} | ||
| {{- $configMapName := (include "nvidia-device-plugin.configMapName" .) | trim }} | ||
| {{- $daemonsetName := printf "%s" (include "nvidia-device-plugin.fullname" .) | trunc 63 | trimSuffix "-" }} | ||
| {{- $devicePluginNodeSelector := merge .Values.devicePlugin.nodeSelector .Values.nodeSelector }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is merging what we want here? Should we not use devicePlugin.nodeSelector if specified and fallback to the top-level nodeSelector if it is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me give a better example of what we have:
| - | node 1 | node 2 | node 3 |
|---|---|---|---|
| accelerator | nvidia | nvidia | aws |
| ami | al2023 | bottlerocket | bottlerocket |
- If we don't have any nodeselectors, we get pods scheduled on all nodes, but we don't need it on nodes 2 and 3
- If we have nodeSelector for accelerator: nvidia, we get it on 1 and 2, but we don't need it on 2.
We do need it on 1, but we don't want it to schedule on any other node. This attempts to maintain backwards compatibility while filtering out nodes that don't need it. The idea is having a global selector (accelerator) and filtering using a device plugin specific one (ami)
This PR enables device plugin specific node selectors. The idea is to retain the global functionality of
.Values.nodeSelector, but allow merging device plugin specific node selectors. Bottlerocket does not need the device plugin, but benefits from the rest of daemonsets. Right now, deploying as-is results in the device plugin constantly crashing on Bottlerocket nodes.