- 
                Notifications
    You must be signed in to change notification settings 
- Fork 4
Description
Description
When running dcgmi diag -r 1 with snappy-debug the following issue appear:
INFO: Following '/var/log/syslog'. If have dropped messages, use:
INFO: $ sudo journalctl --output=short --follow --all | sudo snappy-debug
= AppArmor =
Time: Sep  9 21:43:32
Log: apparmor="DENIED" operation="connect" profile="snap.dcgm.nv-hostengine" name="/run/nvidia-persistenced/socket" pid=66001 comm="nv-hostengine" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=114
File: /run/nvidia-persistenced/socket (write)
Suggestions:
* adjust program to use $SNAP_DATA
* adjust program to use /run/shm/snap.$SNAP_NAME.*
* adjust program to use /run/snap.$SNAP_NAME.*
* adjust snap to use snap layouts (https://forum.snapcraft.io/t/snap-layouts/7207)
Currently there is no interface for this.
Also this was detected:
= AppArmor =
Time: Sep 15 19:23:56
Log: apparmor="DENIED" operation="unlink" profile="snap.dcgm.nv-hostengine" name="/dev/char/195:0" pid=163107 comm="nv-hostengine" requested_mask="d" denied_mask="d" fsuid=0 ouid=0
File: /dev/char/195:0 (write)
= AppArmor =
Time: Sep 15 19:23:56
Log: apparmor="DENIED" operation="unlink" profile="snap.dcgm.nv-hostengine" name="/dev/char/195:1" pid=163107 comm="nv-hostengine" requested_mask="d" denied_mask="d" fsuid=0 ouid=0
File: /dev/char/195:1 (write)
From what I investigated /dev/char/195:1:
- Major number 195 is reserved for NVIDIA devices.
- Minor 1 → /dev/nvidiactl, the control device all NVIDIA userspace components talk to.
Adding the nvidia-drivers-support interface to the snap solves this issue, but in the description of the interface says:
is for internal Ubuntu Core use only
alfonsosanchezbeato said this:
I think the extra devices/socket should be added to the opengl interface, as we have done in the past (yes, it is a weird interface to use but there is where you put all GPU permissions). nvidia-drivers-support was meant as a way to be able to assemble nvidia kernel drivers.
The output of the command looks like this even after enabling the persistence mode:
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.8                                          |
| Driver Version Detected   | 570.172.08                                     |
| GPU Device IDs Detected   | 26b5,26b5,26b5,26b5,26b5,26b5,26b5,26b5        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Info                      | Persistence mode for GPU 0 is disabled. Enabl  |
|                           | e persistence mode by running "nvidia-smi -i   |
|                           | <gpuId> -pm 1 " as root.,Persistence mode for  |
|                           |  GPU 1 is disabled. Enable persistence mode b  |
|                           | y running "nvidia-smi -i <gpuId> -pm 1 " as r  |
|                           | oot.,Persistence mode for GPU 2 is disabled.   |
|                           | Enable persistence mode by running "nvidia-sm  |
|                           | i -i <gpuId> -pm 1 " as root.,Persistence mod  |
|                           | e for GPU 3 is disabled. Enable persistence m  |
|                           | ode by running "nvidia-smi -i <gpuId> -pm 1 "  |
|                           |  as root.,Persistence mode for GPU 4 is disab  |
|                           | led. Enable pers                               |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+---------------------------+------------------------------------------------+
Snap version
3.3.8+snap-45c9cd4857
Expected behaviour
After enabling Persistence mode for GPU with nvidia-smi -i <gpuId> -pm 1, There should be no Info saying that it need to enable when running dcgmi diag -r 1 .
Something like:
Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.8                                          |
| Driver Version Detected   | 570.172.08                                     |
| GPU Device IDs Detected   | 26b5,26b5,26b5,26b5,26b5,26b5,26b5,26b5        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+---------------------------+------------------------------------------------+
Reproduce / Test
Install dcgm on a machine with NVIDIA GPUs, Install driver version >= 525 and < 580 (Cuda12 compatible) and run dcgmi diag -r 1
Notes & References
No response