PytorchStreamReader failed reading zip archive: not a ZIP archive

### Bug description

After fine-tuning a 1.3B BERT model using Lightning + deepspeed stage2 + single-machine with 4 GPUs, there's an error converting the checkpoint using zero_to_fp32.py.
However, the conversion works fine when the model is changed to a 0.1B BERT model.

[error information]:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)


Traceback (most recent call last):
  File "/data/xxx/DstoryProgram/ds-algo-llm-ie/common_utils/zero_to_fp32.py", line 628, in <module>
    state_dict = get_fp32_state_dict_from_zero_checkpoint(save_path)
  File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 523, in get_fp32_state_dict_from_zero_checkpoint
    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
  File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 209, in _get_fp32_state_dict_from_zero_checkpoint
    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
  File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 150, in parse_optim_states
    state_dict = torch.load(f, map_location=device)
  File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 1326, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 671, in __init__
    super().__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

Process finished with exit code 1

[checkpoint dir]:
-rw-rw-r-- 1 xxx xxx 3.6G Nov  5 18:18 bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov  5 18:18 bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov  5 18:18 bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov  5 18:18 bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 2.1G Nov  5 18:17 mp_rank_00_model_states.pt

When I use torch.load('bf16_zero_pp_rank_x_mp_rank_00_optim_states.pt'), it works fine.
However, when I use torch.load('mp_rank_00_model_states.pt'), it fails.

### What version are you seeing the problem on?

v2.4

### How to reproduce the bug

_No response_

### Error messages and logs

```
# Error messages and logs here please
```


### Environment

<details>
  <summary>Current environment</summary>

```
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
```

</details>


### More info

_No response_

cc @awaelchli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Description

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions