-
Notifications
You must be signed in to change notification settings - Fork 81
Open
Description
Hi, I am repeating the experiment in the "osdi21-artifact" branch. However, I have encountered multiple jobs failure due to some errors:
Traceback (most recent call last):
File "run_glue.py", line 750, in <module>
main()
File "run_glue.py", line 476, in main
model = adaptdl.torch.AdaptiveDataParallel(model, optimizer, lr_scheduler)
File "/root/adaptdl/adaptdl/torch/parallel.py", line 68, in __init__
adaptdl.checkpoint.load_state(self._state)
File "/root/adaptdl/adaptdl/checkpoint.py", line 137, in load_state
state.load(f)
File "/root/adaptdl/adaptdl/torch/parallel.py", line 194, in load
state_dicts, self.gain = torch.load(fileobj)
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 600, in load
with _open_zipfile_reader(opened_file) as opened_zipfile:
File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 242, in __init__
super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
which usually happens when rescaling. I think this possibly resulted from the conflict of environment. Therefore, could you please provide the versions of Pytorch, Cuda, Python, and other necessary modules?
Metadata
Metadata
Assignees
Labels
No labels