Skip to content

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python #31

@Marianne-Lu

Description

@Marianne-Lu

I'm working on the denoising task of the DAVIS dataset. The command I used to run the process is as follows:
python -m torch.distributed.run --nproc_per_node=3 --master_port=1234 main_train_vrt.py --opt options/rvrt/006_train_rvrt_videodenoising_davis.json --dist True

Here are the details of the error message:
Traceback (most recent call last):
File "main_train_vrt.py", line 319, in
main()
File "main_train_vrt.py", line 192, in main
model.optimize_parameters(current_step)
File "/share3/home/lufuzhen/code/kair/models/model_vrt.py", line 77, in optimize_parameters
super(ModelVRT, self).optimize_parameters(current_step)
File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 165, in optimize_parameters
self.netG_forward()
File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 158, in netG_forward
self.E = self.netG(self.L)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1169, in forward
feats = self.propagate(feats, flows, module_name, updated_flows)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1048, in propagate
feat_prop = self.deform_align[module_name](feat_q, feat_k, feat_prop,
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 250, in forward
v = deform_attn(q, kv, offset, self.kernel_h, self.kernel_w, self.stride, self.padding, self.dilation,
File "/share3/home/lufuzhen/code/kair/models/op/deform_attn.py", line 80, in forward
deform_attn_ext.deform_attn_forward(q, kv, offset, output,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 2; 23.64 GiB total capacity; 4.36 GiB already allocated; 166.50 MiB free; 4.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
File "main_train_vrt.py", line 319, in
main()
File "main_train_vrt.py", line 192, in main
model.optimize_parameters(current_step)
File "/share3/home/lufuzhen/code/kair/models/model_vrt.py", line 77, in optimize_parameters
super(ModelVRT, self).optimize_parameters(current_step)
File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 165, in optimize_parameters
self.netG_forward()
File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 158, in netG_forward
self.E = self.netG(self.L)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1169, in forward
feats = self.propagate(feats, flows, module_name, updated_flows)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1061, in propagate
feat_prop = feat_prop + self.backbone[module_name](torch.cat(feat, dim=2))
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 706, in forward
return self.main(x)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 655, in forward
return x + self.linear(self.residual_group(x).transpose(1, 4)).transpose(1, 4)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 23.64 GiB total capacity; 4.58 GiB already allocated; 18.50 MiB free; 4.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[W reducer.cpp:325] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.
grad.sizes() = [64, 192, 1, 1, 1], strides() = [192, 1, 192, 192, 192]
bucket_view.sizes() = [64, 192, 1, 1, 1], strides() = [192, 1, 1, 1, 1] (function operator())
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2498673 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python
Traceback (most recent call last):
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in
main()
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main_train_vrt.py FAILED

Failures:
[1]:
time : 2025-04-18_13:27:39
host : server-03
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2498675)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2025-04-18_13:27:39
host : server-03
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2498674)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

error 1 :CUDA out of memory
I've successfully run REDS, but when running the DAVIS dataset for denoising, I get a 'CUDA out of memory' error.
Here are the steps I've taken to address the issue:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Reduced dataloader_batch_size from 8 to 6, then to 3
Reduced dataloader_num_workers from 32 to 6, then to 2
Reduced num_frame from 16 to 4
error 2:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python
Online research suggests that this PyTorch error is due to the dataset being too large.

Has anyone else encountered the similar problems?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions