Skip to content

pytorch crashed during tutorial  #14

@davidhoover

Description

@davidhoover

I recently installed spisonet and attempted to run the tutorial. Pytorch crashed immediately during the training with these errors:

07-01 10:44:32, INFO     voxel_size 1.309999942779541
07-01 10:44:33, INFO     spIsoNet correction until resolution 3.5A!
                     Information beyond 3.5A remains unchanged
07-01 10:44:42, INFO     Start preparing subvolumes!
07-01 10:44:48, INFO     Done preparing subvolumes!
07-01 10:44:48, INFO     Start training!
07-01 10:44:52, INFO     Port number: 42933
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
  0%|                                                                                                                              | 0/250 [00:00<?, ?batch/s]/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/conv.py:605: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:84.)
  return F.conv3d(
  0%|                                                                                                                              | 0/250 [00:05<?, ?batch/s]
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/bin/spisonet.py", line 8, in <module>
    sys.exit(main())
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
    fire.Fire(ISONET)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
    map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta,  voxel_size=voxel_size, output_dir=output_dir,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
    network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
    mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 281, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 237, in start_processes
    while not context.join():
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 188, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 75, in _wrap
    fn(i, *args)
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 160, in ddp_train
    loss.backward()
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/usr/local/apps/spisonet/1.0/mamba/envs/spisonet/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

What version of torch is required? We have 2.3.1+cu118. This was run on a single P100 GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions