Skip to content

nvmlDeviceGetHandleByIndex(5) failed: Unknown Error #20

@Hai-W-L

Description

@Hai-W-L

Hi everyone

I have encountered a problem.

I can successfully run the following command line.
#spisonet.py fsc3d emd_8731_half_map_1.mrc emd_8731_half_map_2.mrc emd_8731_msk_1.mrc --ncpus 16 --limit_res 3.5.

But when I use next command, it failed.
#spisonet.py reconstruct emd_8731_half_map_1.mrc emd_8731_half_map_2.mrc --aniso_file FSC3D.mrc --mask emd_8731_msk_1.mrc --limit_res 3.5 --epochs 30 --alpha 1 --beta 0.5 --output_dir isonet_maps --gpuID 0,1,2,3 --acc_batches 2

01-06 16:35:17, INFO The isonet_maps folder already exists, outputs will write into this folder
01-06 16:35:17, INFO voxel_size 1.309999942779541
01-06 16:35:17, WARNING The isonet_maps/emd_8731_half_map_1_data folder already exists. The old isonet_maps/emd_8731_half_map_1_data folder will be moved to isonet_maps/emd_8731_half_map_1_data~
01-06 16:35:17, WARNING The isonet_maps/emd_8731_half_map_2_data folder already exists. The old isonet_maps/emd_8731_half_map_2_data folder will be moved to isonet_maps/emd_8731_half_map_2_data~
01-06 16:35:18, INFO spIsoNet correction until resolution 3.5A!
Information beyond 3.5A remains unchanged
01-06 16:35:28, INFO Start preparing subvolumes!
01-06 16:35:37, INFO Done preparing subvolumes!
01-06 16:35:37, INFO Start training!
/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/cuda/init.py:716: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
01-06 16:35:40, INFO Port number: 52238
learning rate 0.0003
['isonet_maps/emd_8731_half_map_1_data', 'isonet_maps/emd_8731_half_map_2_data']
/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/cuda/init.py:716: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/cuda/init.py:716: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/cuda/init.py:716: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/cuda/init.py:716: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
[rank0]:[W106 16:35:49.058980636 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W0106 16:35:49.432000 94306 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 96574 via signal SIGTERM
W0106 16:35:49.433000 94306 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 97032 via signal SIGTERM
W0106 16:35:49.434000 94306 site-packages/torch/multiprocessing/spawn.py:160] Terminating process 97425 via signal SIGTERM
Traceback (most recent call last):
File "/opt/spIsoNet/spIsoNet/bin/spisonet.py", line 553, in
exit(main())
File "/opt/spIsoNet/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 135, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 468, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/opt/spIsoNet/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 203, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
fn(i, *args)
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 52, in ddp_train
model = DDP(model, device_ids=[rank])
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 825, in init
_verify_param_shape_across_processes(self.process_group, parameters)
File "/opt/miniconda3/envs/spisonet/lib/python3.10/site-packages/torch/distributed/utils.py", line 288, in _verify_param_shape_across_processes
return dist._verify_params_across_processes(process_group, tensors, logger)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:317, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
nvmlDeviceGetHandleByIndex(5) failed: Unknown Error

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions