-
Notifications
You must be signed in to change notification settings - Fork 50
Open
Description
mig-parted will fail to perform a GPU reset, and therefore toggle the MIG mode on GPUs where a reset is required,
if the nvidia_drm kernel module is loaded. Below is the error message one might encounter:
$ sudo nvidia-mig-parted apply -f /etc/nvidia-mig-manager/config-default.yaml -k /etc/nvidia-mig-manager/hooks-default.yaml -c all-balanced
ERRO[0003]
The following GPUs could not be reset:
GPU 00000000:01:00.0: In use by another client
GPU 00000000:47:00.0: In use by another client
GPU 00000000:81:00.0: In use by another client
GPU 00000000:C2:00.0: In use by another client
4 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.
FATA[0008] Error applying MIG configuration with hooks: error resetting all GPUs: exit status 255
A workaround is to unload the nvidia_drm kernel module before applying a MIG configuration and load it after the configuration change has been applied.
$ sudo rmmod nvidia_drm
$ sudo nvidia-mig-parted apply -f /etc/nvidia-mig-manager/config-default.yaml -k /etc/nvidia-mig-manager/hooks-default.yaml -c all-balanced
MIG configuration applied successfully
Metadata
Metadata
Assignees
Labels
No labels