Skip to content

Threaded loading puts tensors on wrong device when device="cuda" #42851

@kmod

Description

@kmod

System Info

  • transformers version: 5.0.0rc1
  • Platform: Linux-6.14.0-36-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Huggingface_hub version: 1.2.1
  • Safetensors version: 0.7.0
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

Who can help?

When I upgraded to v5.0.0rc1 my single-node DDP (via accelerate) setup broke, where during model loading GPU0 will end up having most of the tensors loaded onto it causing an OOM. It looks like what's going on is that #41580 (@ArthurZucker ) made threaded loading the default, and the cuda default device is thread-local, and I had been using device="cuda" which loads onto the default device, so in threaded mode all the default devices reset to 0 and all the weights are loaded on gpu0 causing an oom. If I set device=accelerator.device or HF_DEACTIVATE_ASYNC_LOAD=1 then the error goes away. On v4.57.3 if I set HF_ENABLE_PARALLEL_LOADING=1 I get the same issue.

Currently (v4.57.3), setting device="cuda" works with accelerate because accelerate will set the default device during Accelerator construction. The accelerate docs recommends changing device="cuda" to device=accelerator.device but doesn't mention it as a requirement. I don't know if device_map="cuda' is now considered a user error when using multiple gpus, but if so it would be nice to see that documented or ideally caught. Perhaps the threaded loading code could assert that if the device_map is cuda that there isn't also a non-zero default device set, since in this case the tensor will end up in a different place depending on whether or not threaded loading is enabled.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. Create a script with DP, ie dp_replicate_size>1
  2. Load the model with device_map="cuda"
  3. Launch the script with either torchrun or accelerate launch
  4. See that GPU0 has much more memory used or that it OOM'd

Expected behavior

  • No OOM, like in v4.57.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions