Skip to content

nvidia-device-plugin-ctr crashes with Helm deployment with 0.18.0 (works with 0.17.4) #1469

@rwlove

Description

@rwlove

The failed log states error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy but both the working and failed versions have "deviceDiscoveryStrategy": "auto".

The only thing I notice between the two configs printed in the logs is that the failed one shows "gdrcopyEnabled": false,, but I don't see anything in the logs that indicates this is a problem.

container log from 0.18.0 (crashes)

I1021 20:00:21.023400     234 main.go:239] "Starting NVIDIA Device Plugin" version=<
        3c9ffca9
        commit: 3c9ffca9491f0d2d362a7064138dfcd71bb57592
 >
I1021 20:00:21.023540     234 main.go:242] Starting FS watcher for /var/lib/kubelet/device-plugins
I1021 20:00:21.023631     234 main.go:249] Starting OS watcher.
I1021 20:00:21.024275     234 main.go:264] Starting Plugins.
I1021 20:00:21.024300     234 main.go:321] Loading configuration.
I1021 20:00:21.025862     234 main.go:346] Updating config with default resource matching patterns.
I1021 20:00:21.025955     234 main.go:357]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdrcopyEnabled": false,
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 8
        }
      ]
    }
  },
  "imex": {}
}
I1021 20:00:21.025969     234 main.go:360] Retrieving plugins.
E1021 20:00:21.026176     234 factory.go:113] Incompatible strategy detected auto
E1021 20:00:21.026197     234 factory.go:114] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E1021 20:00:21.026207     234 factory.go:115] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E1021 20:00:21.026216     234 factory.go:116] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E1021 20:00:21.026226     234 factory.go:117] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E1021 20:00:21.029575     234 main.go:177] error starting plugins: error getting plugins: unable to create plugins: failed to construct resource managers: invalid device discovery strategy

container log on 0.17.4 (working)

I1021 20:00:57.809002     119 main.go:235] "Starting NVIDIA Device Plugin" version=<
        fd56a747
        commit: fd56a747defe15333adce40fcd3a06ffb129251b
 >
I1021 20:00:57.809163     119 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I1021 20:00:57.809269     119 main.go:245] Starting OS watcher.
I1021 20:00:57.809685     119 main.go:260] Starting Plugins.
I1021 20:00:57.809740     119 main.go:317] Loading configuration.
I1021 20:00:57.813080     119 main.go:342] Updating config with default resource matching patterns.
I1021 20:00:57.813603     119 main.go:353]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "devices": "all",
          "replicas": 8
        }
      ]
    }
  },
  "imex": {}
}
I1021 20:00:57.813631     119 main.go:356] Retrieving plugins.
I1021 20:00:57.860002     119 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I1021 20:00:57.861888     119 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I1021 20:00:57.865564     119 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions