Threaded loading puts tensors on wrong device when device="cuda"

### System Info

- `transformers` version: 5.0.0rc1
- Platform: Linux-6.14.0-36-generic-x86_64-with-glibc2.39
- Python version: 3.12.3
- Huggingface_hub version: 1.2.1
- Safetensors version: 0.7.0
- Accelerate version: 1.12.0
- Accelerate config:    not found
- DeepSpeed version: not installed
- PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition

### Who can help?

When I upgraded to v5.0.0rc1 my single-node DDP (via accelerate) setup broke, where during model loading GPU0 will end up having most of the tensors loaded onto it causing an OOM. It looks like what's going on is that #41580 (@ArthurZucker ) made threaded loading the default, and the cuda default device is thread-local, and I had been using `device="cuda"` which loads onto the default device, so in threaded mode all the default devices reset to 0 and all the weights are loaded on gpu0 causing an oom. If I set `device=accelerator.device` or HF_DEACTIVATE_ASYNC_LOAD=1 then the error goes away. On v4.57.3 if I set HF_ENABLE_PARALLEL_LOADING=1 I get the same issue.

Currently (v4.57.3), setting device="cuda" works with accelerate because accelerate will set the default device during Accelerator construction. The [accelerate docs](https://huggingface.co/docs/accelerate/en/basic_tutorials/migration) recommends changing device="cuda" to device=accelerator.device but doesn't mention it as a requirement. I don't know if device_map="cuda' is now considered a user error when using multiple gpus, but if so it would be nice to see that documented or ideally caught. Perhaps the threaded loading code could assert that if the device_map is cuda that there isn't also a non-zero default device set, since in this case the tensor will end up in a different place depending on whether or not threaded loading is enabled.

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

1. Create a script with DP, ie dp_replicate_size>1
2. Load the model with device_map="cuda"
3. Launch the script with either torchrun or `accelerate launch`
4. See that GPU0 has much more memory used or that it OOM'd

### Expected behavior

- No OOM, like in v4.57.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Threaded loading puts tensors on wrong device when device="cuda" #42851

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Threaded loading puts tensors on wrong device when device="cuda" #42851

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions