Skip to content

CUDAAccelerator.setup_device(device) initializes a different device (GPU 0) #21725

@bm371613

Description

@bm371613

Bug description

Problem

Despite passing values other than 0 as devices to Trainer, memory was allocated on device 0.

Passing a list of specific devices to pytorch-lightning ideally means that other devices are not used in any way.

Investigation

CUDAAccelerator.setup_device(device) calls _check_cuda_matmul_precision before torch.cuda.set_device(device).

_check_cuda_matmul_precision calls torch.cuda._lazy_init via

  • _check_cuda_matmul_precision
  • _is_ampere_or_later
  • torch.cuda.get_device_capability
  • torch.cuda.get_device_properties
  • torch.cuda._lazy_init

Is this a bug

It was not clear to me what the contract is regarding using devices not explicitly passed, but also not excluded via CUDA_VISIBLE_DEVICES.

A similar issue in pytorch convinced me this is a bug pytorch/pytorch#149119

I also considered whether this was a pytorch-lightning bug or an upstream pytorch bug. I read torch.cuda code and it seems to me that it is up to the caller to first call torch.cuda.set_device before other functions that require CUDA context.

How to fix it

It seems to me that it's just a matter of reordering two lines, so that torch.cuda.set_device(device) is before _check_cuda_matmul_precision, which seems to be harmless: _check_cuda_matmul_precision is just for logging info for some devices.

I will submit a PR for this shortly.

What version are you seeing the problem on?

v2.6, master

Reproduced in studio

No response

How to reproduce the bug

Train with a trainer like


trainer = pytorch_lightning.Trainer(
    devices=[3,4,5,6],
    accelerator="gpu",
    strategy: "ddp_find_unused_parameters_true",
)


and check memory allocation via `nvidia-smi`

Error messages and logs

# Error messages and logs here please

Environment

Current environment
# - PyTorch Lightning Version 2.6.1
# - PyTorch Version 2.6.0
# - Python 3.10
# - Ubuntu
# - CUDA 12.4 /cuDNN 9
# 8 GPUs Nvidia A40
#  uv installation

More info

No response

cc @ethanwharris

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions