What is the version?
4.5.2-4.8.1
What happened?
I have a k3s single node with dual A30 GPU (MIG enabled) and one RTX2000 (for nvcodec). nvcr.io/nvidia/k8s/dcgm-exporter:3.1.6-3.1.3-ubuntu20.04 has been working fine with below csv file:
# Memory usage
DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).
# DCP metrics,,
DCGM_FI_PROF_GR_ENGINE_ACTIVE, gauge, Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_ACTIVE, gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY, gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
DCGM_FI_PROF_DRAM_ACTIVE, gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
DCGM_FI_PROF_PIPE_FP64_ACTIVE, gauge, Ratio of cycles the fp64 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP32_ACTIVE, gauge, Ratio of cycles the fp32 pipes are active (in %).
DCGM_FI_PROF_PIPE_FP16_ACTIVE, gauge, Ratio of cycles the fp16 pipes are active (in %).
Recently we upgraded the driver to from R550 to R580, so need to upgrade dcgm-exporter. We choose nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04 but the pod fails to come up. Below the error from the pod
time=2026-05-09T18:40:22.491Z level=INFO msg="Starting dcgm-exporter" Version=4.5.2-4.8.1
time=2026-05-09T18:40:22.503Z level=INFO msg="Attempting to initialize DCGM."
time=2026-05-09T18:40:22.922Z level=INFO msg="Initialized DCGM Fields module."
time=2026-05-09T18:40:22.922Z level=INFO msg="Attempting to initialize NVML library."
time=2026-05-09T18:40:22.922Z level=INFO msg="NVML provider successfully initialized for Kubernetes MIG support"
time=2026-05-09T18:40:22.922Z level=INFO msg="DCGM successfully initialized!"
time=2026-05-09T18:40:23.266Z level=INFO msg="Successfully queried DCGM profiling metric groups" reload_id=0 count=7 gpu_model="NVIDIA A30"
time=2026-05-09T18:40:23.266Z level=INFO msg="Building registry for current GPU topology"
time=2026-05-09T18:40:23.266Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-custom-metrics.csv'"
time=2026-05-09T18:40:23.266Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2026-05-09T18:40:23.320Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2026-05-09T18:40:23.320Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2026-05-09T18:40:23.320Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2026-05-09T18:40:23.320Z level=WARN msg="Failed to initialize NvSwitch/NvLink info" error="no switches to monitor"
time=2026-05-09T18:40:23.336Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2026-05-09T18:40:24.034Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-09T18:40:24.034Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2026-05-09T18:40:24.034Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-09T18:40:24.174Z level=ERROR msg="DCGM collector for entity type 'GPU' cannot be initialized; err: error watching fields: Feature not supported"
What did you expect to happen?
dcgm-exporter pod should come up.
What is the GPU model?
A30 (MIG enabled) and RTX2000
What is the environment?
single node k3s
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response
What is the version?
4.5.2-4.8.1
What happened?
I have a k3s single node with dual A30 GPU (MIG enabled) and one RTX2000 (for nvcodec). nvcr.io/nvidia/k8s/dcgm-exporter:3.1.6-3.1.3-ubuntu20.04 has been working fine with below csv file:
Recently we upgraded the driver to from R550 to R580, so need to upgrade dcgm-exporter. We choose nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04 but the pod fails to come up. Below the error from the pod
What did you expect to happen?
dcgm-exporter pod should come up.
What is the GPU model?
A30 (MIG enabled) and RTX2000
What is the environment?
single node k3s
How did you deploy the dcgm-exporter and what is the configuration?
No response
How to reproduce the issue?
No response
Anything else we need to know?
No response