-
Notifications
You must be signed in to change notification settings - Fork 50
Description
Hi,
I've a bunch of servers with 4 A100 GPUs each and I've MIG-partitioned each GPU in the 'all-balanced' profile and managed them through Slurm.
$ cat /etc/nvidia-mig-manager/config.yaml
...
all-balanced:
...
# H100-80GB, H800-80GB, A100-80GB, A800-80GB
- device-filter: ["0x233110DE", "0x233010DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
"3g.40gb": 1
...
WIth NVIDIA driver 495.x, I could partition them as follows without any issues.
- 1g.10gb : 2
- 2g.20gb : 1
- 3g.40gb : 1
However, with the latest drivers, namely 535.x and 545.x, each GPUs get partitioned into
- 1g.10gb : 2
- 2g.20gb : 1
- 3g.39gb : 1
I use AutoDetect=nvml for Slurm to detect the types of MIG partitions and their CPU affinities. Slurm reports this discrepancy in the logs:
$ slurmd -G
slurmd: gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: error: Discarding the following config-only GPU due to lack of File specification:
slurmd: error: GRES[gpu] Type:3g.40gb Count:1 Cores(64):(null) Links:(null) Flags:HAS_TYPE,ENV_NVML,ENV_RSMI,ENV_ONEAPI,ENV_OPENCL,ENV_DEFAULT File:(null) UniqueId:(null)
slurmd: gres/gpu: _merge_system_gres_conf: WARNING: The following autodetected GPUs are being ignored:
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):24-31 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia0,/dev/nvidia-caps/nvidia-cap21,/dev/nvidia-caps/nvidia-cap22 UniqueId:MIG-efa8d929-9af6-5083-af99-f1ceefb8b29a
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):8-15 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia1,/dev/nvidia-caps/nvidia-cap156,/dev/nvidia-caps/nvidia-cap157 UniqueId:MIG-44b932cc-40b5-5e7b-b01b-7e342ecfcb64
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):56-63 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia2,/dev/nvidia-caps/nvidia-cap291,/dev/nvidia-caps/nvidia-cap292 UniqueId:MIG-e3ab25b5-7be9-5d4b-940d-63841fead660
slurmd: GRES[gpu] Type:nvidia_a100_3g.39gb Count:1 Cores(64):40-47 Links:-1,0,0,0 Flags:HAS_FILE,HAS_TYPE,ENV_NVML File:/dev/nvidia3,/dev/nvidia-caps/nvidia-cap426,/dev/nvidia-caps/nvidia-cap427 UniqueId:MIG-60b3faac-01f1-5bc1-be5c-c53e2c4e0d82
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=31 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap30,/dev/nvidia-caps/nvidia-cap31 Cores=24-31 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=166 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap165,/dev/nvidia-caps/nvidia-cap166 Cores=8-15 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=301 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap300,/dev/nvidia-caps/nvidia-cap301 Cores=56-63 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=2g.20gb Count=1 Index=436 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap435,/dev/nvidia-caps/nvidia-cap436 Cores=40-47 CoreCnt=64 Links=0,-1,0,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=85 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap84,/dev/nvidia-caps/nvidia-cap85 Cores=24-31 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=220 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap219,/dev/nvidia-caps/nvidia-cap220 Cores=8-15 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=355 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap354,/dev/nvidia-caps/nvidia-cap355 Cores=56-63 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=490 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap489,/dev/nvidia-caps/nvidia-cap490 Cores=40-47 CoreCnt=64 Links=0,0,-1,0
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=94 ID=7696487 File=/dev/nvidia0,/dev/nvidia-caps/nvidia-cap93,/dev/nvidia-caps/nvidia-cap94 Cores=24-31 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=229 ID=7696487 File=/dev/nvidia1,/dev/nvidia-caps/nvidia-cap228,/dev/nvidia-caps/nvidia-cap229 Cores=8-15 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=364 ID=7696487 File=/dev/nvidia2,/dev/nvidia-caps/nvidia-cap363,/dev/nvidia-caps/nvidia-cap364 Cores=56-63 CoreCnt=64 Links=0,0,0,-1
slurmd: Gres Name=gpu Type=1g.10gb Count=1 Index=499 ID=7696487 File=/dev/nvidia3,/dev/nvidia-caps/nvidia-cap498,/dev/nvidia-caps/nvidia-cap499 Cores=40-47 CoreCnt=64 Links=0,0,0,-1
I have tried using nvidia-mig-manager versions [0.5.3, 0.5.4.1 and 0.5.5] and I see the same behavior as long as the NVIDIA driver version is 535 or 545. I haven't tried 505, 515, 525.
=== w/ NVIDIA driver 495.x ===
$ nvidia-smi
Wed Dec 13 00:32:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... Off | 00000000:01:00.0 Off | On |
| N/A 25C P0 50W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... Off | 00000000:41:00.0 Off | On |
| N/A 25C P0 51W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... Off | 00000000:81:00.0 Off | On |
| N/A 25C P0 48W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... Off | 00000000:C1:00.0 Off | On |
| N/A 25C P0 50W / 500W | 24MiB / 81251MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 1 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 2 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 3 0 1 | 6MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 9 0 2 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 3 10 0 3 | 3MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
=== w/ NVIDIA driver 545.x ===
$ nvidia-smi
Wed Dec 13 00:29:41 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08 Driver Version: 545.23.08 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-SXM4-80GB On | 00000000:01:00.0 Off | On |
| N/A 26C P0 51W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM4-80GB On | 00000000:41:00.0 Off | On |
| N/A 26C P0 49W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM4-80GB On | 00000000:81:00.0 Off | On |
| N/A 27C P0 51W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM4-80GB On | 00000000:C1:00.0 Off | On |
| N/A 25C P0 48W / 500W | 87MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+--------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+================================+===========+=======================|
| 0 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 0 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 1 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 2 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| | 0MiB / 65535MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 3 0 1 | 25MiB / 19968MiB | 28 0 | 2 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 9 0 2 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
| 3 10 0 3 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+--------------------------------+-----------+-----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
Looking at the memory of the different partitions, the 10GB and 20GB partitions are the same regardless of the NVIDIA driver version, but the "40GB" partitions are a little lower (40192MiB) for driver version 545.x compared to 40448 MiB for driver version 495.x, I see that the memory is smaller for
=== w/ NVIDIA driver 495.x ===
$ nvidia-smi | grep " 2 0 0"
| 0 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| 1 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| 2 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
| 3 2 0 0 | 10MiB / 40448MiB | 42 0 | 3 0 2 0 0 |
=== w/ NVIDIA driver 545.x ===
$ nvidia-smi | grep " 2 0 0"
| 0 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| 1 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| 2 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
| 3 2 0 0 | 37MiB / 40192MiB | 42 0 | 3 0 2 0 0 |
This is perhaps the reason why the partition is reported as 3g.39gb instead of 3g.40gb. Since we already have lots of GPUs with 3g.40gb partitions and people are trained to use them, having to hack things by introducing a different label for the same Slurm GRES would create a lot of confusion and inconvenience. So, we should appreciate any guidance in resolving this issue.
Thanks a lot.