-
Notifications
You must be signed in to change notification settings - Fork 754
Open
Description
The failed log states error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy but both the working and failed versions have "deviceDiscoveryStrategy": "auto".
The only thing I notice between the two configs printed in the logs is that the failed one shows "gdrcopyEnabled": false,, but I don't see anything in the logs that indicates this is a problem.
container log from 0.18.0 (crashes)
I1021 20:00:21.023400 234 main.go:239] "Starting NVIDIA Device Plugin" version=<
3c9ffca9
commit: 3c9ffca9491f0d2d362a7064138dfcd71bb57592
>
I1021 20:00:21.023540 234 main.go:242] Starting FS watcher for /var/lib/kubelet/device-plugins
I1021 20:00:21.023631 234 main.go:249] Starting OS watcher.
I1021 20:00:21.024275 234 main.go:264] Starting Plugins.
I1021 20:00:21.024300 234 main.go:321] Loading configuration.
I1021 20:00:21.025862 234 main.go:346] Updating config with default resource matching patterns.
I1021 20:00:21.025955 234 main.go:357]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdrcopyEnabled": false,
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 8
}
]
}
},
"imex": {}
}
I1021 20:00:21.025969 234 main.go:360] Retrieving plugins.
E1021 20:00:21.026176 234 factory.go:113] Incompatible strategy detected auto
E1021 20:00:21.026197 234 factory.go:114] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1021 20:00:21.026207 234 factory.go:115] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1021 20:00:21.026216 234 factory.go:116] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1021 20:00:21.026226 234 factory.go:117] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E1021 20:00:21.029575 234 main.go:177] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy
container log on 0.17.4 (working)
I1021 20:00:57.809002 119 main.go:235] "Starting NVIDIA Device Plugin" version=<
fd56a747
commit: fd56a747defe15333adce40fcd3a06ffb129251b
>
I1021 20:00:57.809163 119 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1021 20:00:57.809269 119 main.go:245] Starting OS watcher.
I1021 20:00:57.809685 119 main.go:260] Starting Plugins.
I1021 20:00:57.809740 119 main.go:317] Loading configuration.
I1021 20:00:57.813080 119 main.go:342] Updating config with default resource matching patterns.
I1021 20:00:57.813603 119 main.go:353]
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/",
"nvidiaDevRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"deviceDiscoveryStrategy": "auto",
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 8
}
]
}
},
"imex": {}
}
I1021 20:00:57.813631 119 main.go:356] Retrieving plugins.
I1021 20:00:57.860002 119 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I1021 20:00:57.861888 119 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1021 20:00:57.865564 119 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet
gabrielbussolo and gilgameshfreedom
Metadata
Metadata
Assignees
Labels
No labels