Bug description
Problem
Despite passing values other than 0 as devices to Trainer, memory was allocated on device 0.
Passing a list of specific devices to pytorch-lightning ideally means that other devices are not used in any way.
Investigation
CUDAAccelerator.setup_device(device) calls _check_cuda_matmul_precision before torch.cuda.set_device(device).
_check_cuda_matmul_precision calls torch.cuda._lazy_init via
_check_cuda_matmul_precision
_is_ampere_or_later
torch.cuda.get_device_capability
torch.cuda.get_device_properties
torch.cuda._lazy_init
Is this a bug
It was not clear to me what the contract is regarding using devices not explicitly passed, but also not excluded via CUDA_VISIBLE_DEVICES.
A similar issue in pytorch convinced me this is a bug pytorch/pytorch#149119
I also considered whether this was a pytorch-lightning bug or an upstream pytorch bug. I read torch.cuda code and it seems to me that it is up to the caller to first call torch.cuda.set_device before other functions that require CUDA context.
How to fix it
It seems to me that it's just a matter of reordering two lines, so that torch.cuda.set_device(device) is before _check_cuda_matmul_precision, which seems to be harmless: _check_cuda_matmul_precision is just for logging info for some devices.
I will submit a PR for this shortly.
What version are you seeing the problem on?
v2.6, master
Reproduced in studio
No response
How to reproduce the bug
Train with a trainer like
trainer = pytorch_lightning.Trainer(
devices=[3,4,5,6],
accelerator="gpu",
strategy: "ddp_find_unused_parameters_true",
)
and check memory allocation via `nvidia-smi`
Error messages and logs
# Error messages and logs here please
Environment
Current environment
# - PyTorch Lightning Version 2.6.1
# - PyTorch Version 2.6.0
# - Python 3.10
# - Ubuntu
# - CUDA 12.4 /cuDNN 9
# 8 GPUs Nvidia A40
# uv installation
More info
No response
cc @ethanwharris
Bug description
Problem
Despite passing values other than 0 as
devicestoTrainer, memory was allocated on device 0.Passing a list of specific devices to
pytorch-lightningideally means that other devices are not used in any way.Investigation
CUDAAccelerator.setup_device(device)calls_check_cuda_matmul_precisionbeforetorch.cuda.set_device(device)._check_cuda_matmul_precisioncallstorch.cuda._lazy_initvia_check_cuda_matmul_precision_is_ampere_or_latertorch.cuda.get_device_capabilitytorch.cuda.get_device_propertiestorch.cuda._lazy_initIs this a bug
It was not clear to me what the contract is regarding using devices not explicitly passed, but also not excluded via
CUDA_VISIBLE_DEVICES.A similar issue in
pytorchconvinced me this is a bug pytorch/pytorch#149119I also considered whether this was a pytorch-lightning bug or an upstream pytorch bug. I read
torch.cudacode and it seems to me that it is up to the caller to first calltorch.cuda.set_devicebefore other functions that require CUDA context.How to fix it
It seems to me that it's just a matter of reordering two lines, so that
torch.cuda.set_device(device)is before_check_cuda_matmul_precision, which seems to be harmless:_check_cuda_matmul_precisionis just for logging info for some devices.I will submit a PR for this shortly.
What version are you seeing the problem on?
v2.6, master
Reproduced in studio
No response
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
cc @ethanwharris