Skip to content

Failed to collect metrics: could not load NVML library #1

@zh168654

Description

@zh168654

This is my deployment:

apiVersion: apps/v1beta1
kind: Deployment

metadata:
  name: nvidia-exporter
  namespace: monitoring
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: nvidia-exporter
    spec:
      containers:
        - name: nvidia-exporter
          securityContext:
            privileged: true
          image: bugroger/nvidia-exporter:latest
          imagePullPolicy: IfNotPresent
          ports:
            - containerPort: 9401
          volumeMounts:
            - mountPath: /usr/local/nvidia
              name: nvidia
      volumes:
        - name: nvidia
          hostPath:
            path: /home/zy/cuda

when I exec into nvidia-exporter and run

ls /usr/local/nvidia/lib64

there exists libnvidia-ml.so.1
But the container logs always show

Failed to collect metrics: could not load NVML library

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions