Moving to CUDA after safe_open is slow

### System Info

Depending on the file system, moving safe-tensors to CUDA is 10x slower than first cloning the tensors then moving them. This issue has heavy implications for other HF libraries (see https://github.com/huggingface/diffusers/issues/12599).

In this case my file system (`scratch`) is [BeeGFS](https://www.beegfs.io), which is very common in HPC clusters.

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Reproduction

First create a safetensors file on your filesystem.

```python
import torch
import time

from safetensors import safe_open
from safetensors.torch import save_file, load

weights = {}

for i in range(7):
    weights[f"weight.{i}"] = torch.randn((1024, 1024 + i))

save_file(weights, "scratch/model.safetensors")
```

Moving the tensors to CUDA after reading them is slow.

```python
%%timeit

weights = {}

with safe_open("scratch/model.safetensors", framework="pt", device="cpu") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)

temp = [w.cuda() for w in weights.values()]

torch.cuda.synchronize()
# 903 ms ± 7.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Cloning the tensors then moving them to CUDA is 10x faster.

```python
%%timeit

weights = {}

with safe_open("scratch/model.safetensors", framework="pt", device="cpu") as f:
    for k in f.keys():
        weights[k] = f.get_tensor(k)

temp = [w.clone().cuda() for w in weights.values()]  # clone then move

torch.cuda.synchronize()
# 138 ms ± 1.27 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

In both cases, loading all tensors at once (no memory mapping) is faster.

```python
%%timeit

weights = load(open("scratch/model.safetensors", "rb").read())

temp = [w.cuda() for w in weights.values()]

torch.cuda.synchronize()
# 31.7 ms ± 460 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

### Expected behavior

Moving to CUDA should not be slower than first cloning then moving.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Moving to CUDA after safe_open is slow #672

System Info

Information

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Moving to CUDA after safe_open is slow #672

Description

System Info

Information

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions