Skip to content

Version of Pytorch and Cuda #136

@yuxiangwei0808

Description

@yuxiangwei0808

Hi, I am repeating the experiment in the "osdi21-artifact" branch. However, I have encountered multiple jobs failure due to some errors:

Traceback (most recent call last):
  File "run_glue.py", line 750, in <module>
    main()
  File "run_glue.py", line 476, in main
    model = adaptdl.torch.AdaptiveDataParallel(model, optimizer, lr_scheduler)
  File "/root/adaptdl/adaptdl/torch/parallel.py", line 68, in __init__
    adaptdl.checkpoint.load_state(self._state)
  File "/root/adaptdl/adaptdl/checkpoint.py", line 137, in load_state
    state.load(f)
  File "/root/adaptdl/adaptdl/torch/parallel.py", line 194, in load
    state_dicts, self.gain = torch.load(fileobj)
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 600, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
  File "/opt/conda/lib/python3.8/site-packages/torch/serialization.py", line 242, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

which usually happens when rescaling. I think this possibly resulted from the conflict of environment. Therefore, could you please provide the versions of Pytorch, Cuda, Python, and other necessary modules?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions