Skip to content

PytorchStreamReader failed reading zip archive: not a ZIP archive #20398

Open
@Crazy-LittleBoy

Description

@Crazy-LittleBoy

Bug description

After fine-tuning a 1.3B BERT model using Lightning + deepspeed stage2 + single-machine with 4 GPUs, there's an error converting the checkpoint using zero_to_fp32.py.
However, the conversion works fine when the model is changed to a 0.1B BERT model.

[error information]:
from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_path)

Traceback (most recent call last):
File "/data/xxx/DstoryProgram/ds-algo-llm-ie/common_utils/zero_to_fp32.py", line 628, in
state_dict = get_fp32_state_dict_from_zero_checkpoint(save_path)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 523, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 209, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/deepspeed/utils/zero_to_fp32.py", line 150, in parse_optim_states
state_dict = torch.load(f, map_location=device)
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 1326, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/data/xxx/miniconda3/envs/llm/lib/python3.10/site-packages/torch/serialization.py", line 671, in init
super().init(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: not a ZIP archive

Process finished with exit code 1

[checkpoint dir]:
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 3.6G Nov 5 18:18 bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 xxx xxx 2.1G Nov 5 18:17 mp_rank_00_model_states.pt

When I use torch.load('bf16_zero_pp_rank_x_mp_rank_00_optim_states.pt'), it works fine.
However, when I use torch.load('mp_rank_00_model_states.pt'), it fails.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

cc @awaelchli

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions