-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Description
When running dcgmi diag -r 1 with snappy-debug the following issue appear:
INFO: Following '/var/log/syslog'. If have dropped messages, use:
INFO: $ sudo journalctl --output=short --follow --all | sudo snappy-debug
= AppArmor =
Time: Sep 9 21:43:32
Log: apparmor="DENIED" operation="connect" profile="snap.dcgm.nv-hostengine" name="/run/nvidia-persistenced/socket" pid=66001 comm="nv-hostengine" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=114
File: /run/nvidia-persistenced/socket (write)
Suggestions:
* adjust program to use $SNAP_DATA
* adjust program to use /run/shm/snap.$SNAP_NAME.*
* adjust program to use /run/snap.$SNAP_NAME.*
* adjust snap to use snap layouts (https://forum.snapcraft.io/t/snap-layouts/7207)
Currently there is no interface for this.
Also this was detected:
= AppArmor =
Time: Sep 15 19:23:56
Log: apparmor="DENIED" operation="unlink" profile="snap.dcgm.nv-hostengine" name="/dev/char/195:0" pid=163107 comm="nv-hostengine" requested_mask="d" denied_mask="d" fsuid=0 ouid=0
File: /dev/char/195:0 (write)
= AppArmor =
Time: Sep 15 19:23:56
Log: apparmor="DENIED" operation="unlink" profile="snap.dcgm.nv-hostengine" name="/dev/char/195:1" pid=163107 comm="nv-hostengine" requested_mask="d" denied_mask="d" fsuid=0 ouid=0
File: /dev/char/195:1 (write)
From what I investigated /dev/char/195:1:
- Major number 195 is reserved for NVIDIA devices.
- Minor 1 → /dev/nvidiactl, the control device all NVIDIA userspace components talk to.
Adding the nvidia-drivers-support interface to the snap solves this issue, but in the description of the interface says:
is for internal Ubuntu Core use only
alfonsosanchezbeato said this:
I think the extra devices/socket should be added to the opengl interface, as we have done in the past (yes, it is a weird interface to use but there is where you put all GPU permissions). nvidia-drivers-support was meant as a way to be able to assemble nvidia kernel drivers.
The output of the command looks like this even after enabling the persistence mode:
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.3.8 |
| Driver Version Detected | 570.172.08 |
| GPU Device IDs Detected | 26b5,26b5,26b5,26b5,26b5,26b5,26b5,26b5 |
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Info | Persistence mode for GPU 0 is disabled. Enabl |
| | e persistence mode by running "nvidia-smi -i |
| | <gpuId> -pm 1 " as root.,Persistence mode for |
| | GPU 1 is disabled. Enable persistence mode b |
| | y running "nvidia-smi -i <gpuId> -pm 1 " as r |
| | oot.,Persistence mode for GPU 2 is disabled. |
| | Enable persistence mode by running "nvidia-sm |
| | i -i <gpuId> -pm 1 " as root.,Persistence mod |
| | e for GPU 3 is disabled. Enable persistence m |
| | ode by running "nvidia-smi -i <gpuId> -pm 1 " |
| | as root.,Persistence mode for GPU 4 is disab |
| | led. Enable pers |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+---------------------------+------------------------------------------------+
Snap version
3.3.8+snap-45c9cd4857
Expected behaviour
After enabling Persistence mode for GPU with nvidia-smi -i <gpuId> -pm 1, There should be no Info saying that it need to enable when running dcgmi diag -r 1 .
Something like:
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic | Result |
+===========================+================================================+
|----- Metadata ----------+------------------------------------------------|
| DCGM Version | 3.3.8 |
| Driver Version Detected | 570.172.08 |
| GPU Device IDs Detected | 26b5,26b5,26b5,26b5,26b5,26b5,26b5,26b5 |
|----- Deployment --------+------------------------------------------------|
| Denylist | Pass |
| NVML Library | Pass |
| CUDA Main Library | Pass |
| Permissions and OS Blocks | Pass |
| Persistence Mode | Pass |
| Environment Variables | Pass |
| Page Retirement/Row Remap | Pass |
| Graphics Processes | Pass |
| Inforom | Pass |
+---------------------------+------------------------------------------------+
Reproduce / Test
Install dcgm on a machine with NVIDIA GPUs, Install driver version >= 525 and < 580 (Cuda12 compatible) and run dcgmi diag -r 1
Notes & References
No response