Skip to content

Dcgmi does not have enough permissions to run diag #68

@gabrielcocenza

Description

@gabrielcocenza

Description

When running dcgmi diag -r 1 with snappy-debug the following issue appear:

INFO: Following '/var/log/syslog'. If have dropped messages, use:
INFO: $ sudo journalctl --output=short --follow --all | sudo snappy-debug
= AppArmor =
Time: Sep  9 21:43:32
Log: apparmor="DENIED" operation="connect" profile="snap.dcgm.nv-hostengine" name="/run/nvidia-persistenced/socket" pid=66001 comm="nv-hostengine" requested_mask="wr" denied_mask="wr" fsuid=0 ouid=114
File: /run/nvidia-persistenced/socket (write)
Suggestions:
* adjust program to use $SNAP_DATA
* adjust program to use /run/shm/snap.$SNAP_NAME.*
* adjust program to use /run/snap.$SNAP_NAME.*
* adjust snap to use snap layouts (https://forum.snapcraft.io/t/snap-layouts/7207)

Currently there is no interface for this.

Also this was detected:

= AppArmor =
Time: Sep 15 19:23:56
Log: apparmor="DENIED" operation="unlink" profile="snap.dcgm.nv-hostengine" name="/dev/char/195:0" pid=163107 comm="nv-hostengine" requested_mask="d" denied_mask="d" fsuid=0 ouid=0
File: /dev/char/195:0 (write)

= AppArmor =
Time: Sep 15 19:23:56
Log: apparmor="DENIED" operation="unlink" profile="snap.dcgm.nv-hostengine" name="/dev/char/195:1" pid=163107 comm="nv-hostengine" requested_mask="d" denied_mask="d" fsuid=0 ouid=0
File: /dev/char/195:1 (write)

From what I investigated /dev/char/195:1:

  • Major number 195 is reserved for NVIDIA devices.
  • Minor 1 → /dev/nvidiactl, the control device all NVIDIA userspace components talk to.

Adding the nvidia-drivers-support interface to the snap solves this issue, but in the description of the interface says:

is for internal Ubuntu Core use only

alfonsosanchezbeato said this:

I think the extra devices/socket should be added to the opengl interface, as we have done in the past (yes, it is a weird interface to use but there is where you put all GPU permissions). nvidia-drivers-support was meant as a way to be able to assemble nvidia kernel drivers.

The output of the command looks like this even after enabling the persistence mode:

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.8                                          |
| Driver Version Detected   | 570.172.08                                     |
| GPU Device IDs Detected   | 26b5,26b5,26b5,26b5,26b5,26b5,26b5,26b5        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Info                      | Persistence mode for GPU 0 is disabled. Enabl  |
|                           | e persistence mode by running "nvidia-smi -i   |
|                           | <gpuId> -pm 1 " as root.,Persistence mode for  |
|                           |  GPU 1 is disabled. Enable persistence mode b  |
|                           | y running "nvidia-smi -i <gpuId> -pm 1 " as r  |
|                           | oot.,Persistence mode for GPU 2 is disabled.   |
|                           | Enable persistence mode by running "nvidia-sm  |
|                           | i -i <gpuId> -pm 1 " as root.,Persistence mod  |
|                           | e for GPU 3 is disabled. Enable persistence m  |
|                           | ode by running "nvidia-smi -i <gpuId> -pm 1 "  |
|                           |  as root.,Persistence mode for GPU 4 is disab  |
|                           | led. Enable pers                               |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+---------------------------+------------------------------------------------+

Snap version

3.3.8+snap-45c9cd4857

Expected behaviour

After enabling Persistence mode for GPU with nvidia-smi -i <gpuId> -pm 1, There should be no Info saying that it need to enable when running dcgmi diag -r 1 .

Something like:

Successfully ran diagnostic for group.
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Metadata  ----------+------------------------------------------------|
| DCGM Version              | 3.3.8                                          |
| Driver Version Detected   | 570.172.08                                     |
| GPU Device IDs Detected   | 26b5,26b5,26b5,26b5,26b5,26b5,26b5,26b5        |
|-----  Deployment  --------+------------------------------------------------|
| Denylist                  | Pass                                           |
| NVML Library              | Pass                                           |
| CUDA Main Library         | Pass                                           |
| Permissions and OS Blocks | Pass                                           |
| Persistence Mode          | Pass                                           |
| Environment Variables     | Pass                                           |
| Page Retirement/Row Remap | Pass                                           |
| Graphics Processes        | Pass                                           |
| Inforom                   | Pass                                           |
+---------------------------+------------------------------------------------+

Reproduce / Test

Install dcgm on a machine with NVIDIA GPUs, Install driver version >= 525 and < 580 (Cuda12 compatible) and run dcgmi diag -r 1

Notes & References

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions