Skip to content

Toggling MIG mode fails when nvidia_drm module is loaded #181

@cdesiniotis

Description

@cdesiniotis

mig-parted will fail to perform a GPU reset, and therefore toggle the MIG mode on GPUs where a reset is required,
if the nvidia_drm kernel module is loaded. Below is the error message one might encounter:

$ sudo nvidia-mig-parted apply -f /etc/nvidia-mig-manager/config-default.yaml -k /etc/nvidia-mig-manager/hooks-default.yaml -c all-balanced
ERRO[0003]
The following GPUs could not be reset:
  GPU 00000000:01:00.0: In use by another client
  GPU 00000000:47:00.0: In use by another client
  GPU 00000000:81:00.0: In use by another client
  GPU 00000000:C2:00.0: In use by another client

4 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.
FATA[0008] Error applying MIG configuration with hooks: error resetting all GPUs: exit status 255

A workaround is to unload the nvidia_drm kernel module before applying a MIG configuration and load it after the configuration change has been applied.

$ sudo rmmod nvidia_drm
$ sudo nvidia-mig-parted apply -f /etc/nvidia-mig-manager/config-default.yaml -k /etc/nvidia-mig-manager/hooks-default.yaml -c all-balanced
MIG configuration applied successfully

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions