Skip to content

MIG partitioning leading to nvidia_a100_3g.39gb instead of 3g.40gb partition for NVIDIA driver versions 535.x and 545.x #31

@berhane

Description

@berhane

Hi,

I've a bunch of servers with 4 A100 GPUs each and I've MIG-partitioned each GPU in the 'all-balanced' profile and managed them through Slurm.

$ cat /etc/nvidia-mig-manager/config.yaml
...
  all-balanced:
...
    # H100-80GB, H800-80GB, A100-80GB, A800-80GB
    - device-filter: ["0x233110DE", "0x233010DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
      devices: all
      mig-enabled: true
      mig-devices:
        "1g.10gb": 2
        "2g.20gb": 1
        "3g.40gb": 1
...

WIth NVIDIA driver 495.x, I could partition them as follows without any issues.

  • 1g.10gb : 2
  • 2g.20gb : 1
  • 3g.40gb : 1

However, with the latest drivers, namely 535.x and 545.x, each GPUs get partitioned into

  • 1g.10gb : 2
  • 2g.20gb : 1
  • 3g.39gb : 1

I use AutoDetect=nvml for Slurm to detect the types of MIG partitions and their CPU affinities. Slurm reports this discrepancy in the logs:

$ slurmd -G 

slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error:     GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null)  Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: gres/gpu: _merge_system_gres_conf: WARNING: The following autodetected GPUs are being ignored:
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):24-31  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-efa8d929-9af6-5083-af99-f1ceefb8b29a
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):8-15  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-44b932cc-40b5-5e7b-b01b-7e342ecfcb64
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):56-63  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 UniqueId:MIG-e3ab25b5-7be9-5d4b-940d-63841fead660
slurmd:     GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):40-47  Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 UniqueId:MIG-60b3faac-01f1-5bc1-be5c-c53e2c4e0d82
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=31 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31 Cores=24-31 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=166 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 Cores=8-15 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=301 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap300,/dev/nvidia-caps/nvidia-cap301 Cores=56-63 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=436 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap435,/dev/nvidia-caps/nvidia-cap436 Cores=40-47 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=85 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 Cores=24-31 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=220 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 Cores=8-15 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=355 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap354,/dev/nvidia-caps/nvidia-cap355 Cores=56-63 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=490 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap489,/dev/nvidia-caps/nvidia-cap490 Cores=40-47 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=94 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 Cores=24-31 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=229 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap228,/dev/nvidia-caps/nvidia-cap229 Cores=8-15 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=364 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap363,/dev/nvidia-caps/nvidia-cap364 Cores=56-63 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=499 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap498,/dev/nvidia-caps/nvidia-cap499 Cores=40-47 CoreCnt=64 Links=0,0,0,-1

I have tried using nvidia-mig-manager versions [0.5.3, 0.5.4.1 and 0.5.5] and I see the same behavior as long as the NVIDIA driver version is 535 or 545. I haven't tried 505, 515, 525.

=== w/ NVIDIA driver 495.x ===

$ nvidia-smi
Wed Dec 13 00:32:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  Off  | 00000000:01:00.0 Off |                   On |
| N/A   25C    P0    50W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  Off  | 00000000:41:00.0 Off |                   On |
| N/A   25C    P0    51W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  Off  | 00000000:81:00.0 Off |                   On |
| N/A   25C    P0    48W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  Off  | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0    50W / 500W |     24MiB / 81251MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+


+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|                  |      0MiB / 65535MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    3   0   1  |      6MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   2  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   10   0   3  |      3MiB /  9728MiB | 14      0 |  1   0    0    0    0 |
|                  |      0MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

=== w/ NVIDIA driver 545.x ===

$  nvidia-smi
Wed Dec 13 00:29:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:01:00.0 Off |                   On |
| N/A   26C    P0              51W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM4-80GB          On  | 00000000:41:00.0 Off |                   On |
| N/A   26C    P0              49W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM4-80GB          On  | 00000000:81:00.0 Off |                   On |
| N/A   27C    P0              51W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM4-80GB          On  | 00000000:C1:00.0 Off |                   On |
| N/A   25C    P0              48W / 500W |     87MiB / 81920MiB |     N/A      Default |
|                                         |                      |              Enabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| MIG devices:                                                                          |
+------------------+--------------------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |                   Memory-Usage |        Vol|      Shared           |
|      ID  ID  Dev |                     BAR1-Usage | SM     Unc| CE ENC DEC OFA JPG    |
|                  |                                |        ECC|                       |
|==================+================================+===========+=======================|
|  0    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  0   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  1   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  2   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    2   0   0  |              37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|                  |               0MiB / 65535MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    3   0   1  |              25MiB / 19968MiB  | 28      0 |  2   0    1    0    0 |
|                  |               0MiB / 32767MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3    9   0   2  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+
|  3   10   0   3  |              12MiB /  9728MiB  | 14      0 |  1   0    0    0    0 |
|                  |               0MiB / 16383MiB  |           |                       |
+------------------+--------------------------------+-----------+-----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Looking at the memory of the different partitions, the 10GB and 20GB partitions are the same regardless of the NVIDIA driver version, but the "40GB" partitions are a little lower (40192MiB) for driver version 545.x compared to 40448 MiB for driver version 495.x, I see that the memory is smaller for

=== w/ NVIDIA driver 495.x ===

$ nvidia-smi | grep " 2   0   0"
|  0    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  1    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  2    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |
|  3    2   0   0  |     10MiB / 40448MiB | 42      0 |  3   0    2    0    0 |

=== w/ NVIDIA driver 545.x ===

$ nvidia-smi | grep " 2   0   0"
|  0    2   0   0  |   37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  1    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  2    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |
|  3    2   0   0  |    37MiB / 40192MiB  | 42      0 |  3   0    2    0    0 |

This is perhaps the reason why the partition is reported as 3g.39gb instead of 3g.40gb. Since we already have lots of GPUs with 3g.40gb partitions and people are trained to use them, having to hack things by introducing a different label for the same Slurm GRES would create a lot of confusion and inconvenience. So, we should appreciate any guidance in resolving this issue.

Thanks a lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions