Skip to content

4.5.2-4.8.1 fails to start on node with A30 (MIG enabled) and RTX2000 #657

@gfrankliu

Description

@gfrankliu

What is the version?

4.5.2-4.8.1

What happened?

I have a k3s single node with dual A30 GPU (MIG enabled) and one RTX2000 (for nvcodec). nvcr.io/nvidia/k8s/dcgm-exporter:3.1.6-3.1.3-ubuntu20.04 has been working fine with below csv file:

    # Memory usage
    DCGM_FI_DEV_FB_FREE, gauge, Framebuffer memory free (in MiB).
    DCGM_FI_DEV_FB_USED, gauge, Framebuffer memory used (in MiB).

    # DCP metrics,,
    DCGM_FI_PROF_GR_ENGINE_ACTIVE,   gauge, Ratio of time the graphics engine is active (in %).
    DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
    DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
    DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active (in %).
    DCGM_FI_PROF_DRAM_ACTIVE,        gauge, Ratio of cycles the device memory interface is active sending or receiving data (in %).
    DCGM_FI_PROF_PIPE_FP64_ACTIVE,   gauge, Ratio of cycles the fp64 pipes are active (in %).
    DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active (in %).
    DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active (in %).

Recently we upgraded the driver to from R550 to R580, so need to upgrade dcgm-exporter. We choose nvcr.io/nvidia/k8s/dcgm-exporter:4.5.2-4.8.1-ubuntu22.04 but the pod fails to come up. Below the error from the pod

time=2026-05-09T18:40:22.491Z level=INFO msg="Starting dcgm-exporter" Version=4.5.2-4.8.1
time=2026-05-09T18:40:22.503Z level=INFO msg="Attempting to initialize DCGM."
time=2026-05-09T18:40:22.922Z level=INFO msg="Initialized DCGM Fields module."
time=2026-05-09T18:40:22.922Z level=INFO msg="Attempting to initialize NVML library."
time=2026-05-09T18:40:22.922Z level=INFO msg="NVML provider successfully initialized for Kubernetes MIG support"
time=2026-05-09T18:40:22.922Z level=INFO msg="DCGM successfully initialized!"
time=2026-05-09T18:40:23.266Z level=INFO msg="Successfully queried DCGM profiling metric groups" reload_id=0 count=7 gpu_model="NVIDIA A30"
time=2026-05-09T18:40:23.266Z level=INFO msg="Building registry for current GPU topology"
time=2026-05-09T18:40:23.266Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-custom-metrics.csv'"
time=2026-05-09T18:40:23.266Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2026-05-09T18:40:23.320Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2026-05-09T18:40:23.320Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2026-05-09T18:40:23.320Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2026-05-09T18:40:23.320Z level=WARN msg="Failed to initialize NvSwitch/NvLink info" error="no switches to monitor"
time=2026-05-09T18:40:23.336Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2026-05-09T18:40:24.034Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-09T18:40:24.034Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2026-05-09T18:40:24.034Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-09T18:40:24.174Z level=ERROR msg="DCGM collector for entity type 'GPU' cannot be initialized; err: error watching fields: Feature not supported"

What did you expect to happen?

dcgm-exporter pod should come up.

What is the GPU model?

A30 (MIG enabled) and RTX2000

What is the environment?

single node k3s

How did you deploy the dcgm-exporter and what is the configuration?

No response

How to reproduce the issue?

No response

Anything else we need to know?

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions