ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python

I'm working on the denoising task of the DAVIS dataset. The command I used to run the process is as follows:
python -m torch.distributed.run --nproc_per_node=3 --master_port=1234 main_train_vrt.py --opt options/rvrt/006_train_rvrt_videodenoising_davis.json  --dist True

 Here are the details of the error message:
Traceback (most recent call last):
  File "main_train_vrt.py", line 319, in <module>
    main()
  File "main_train_vrt.py", line 192, in main
    model.optimize_parameters(current_step)
  File "/share3/home/lufuzhen/code/kair/models/model_vrt.py", line 77, in optimize_parameters
    super(ModelVRT, self).optimize_parameters(current_step)
  File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 165, in optimize_parameters
    self.netG_forward()
  File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 158, in netG_forward
    self.E = self.netG(self.L)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1169, in forward
    feats = self.propagate(feats, flows, module_name, updated_flows)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1048, in propagate
    feat_prop = self.deform_align[module_name](feat_q, feat_k, feat_prop,
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 250, in forward
    v = deform_attn(q, kv, offset, self.kernel_h, self.kernel_w, self.stride, self.padding, self.dilation,
  File "/share3/home/lufuzhen/code/kair/models/op/deform_attn.py", line 80, in forward
    deform_attn_ext.deform_attn_forward(q, kv, offset, output,
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 216.00 MiB (GPU 2; 23.64 GiB total capacity; 4.36 GiB already allocated; 166.50 MiB free; 4.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "main_train_vrt.py", line 319, in <module>
    main()
  File "main_train_vrt.py", line 192, in main
    model.optimize_parameters(current_step)
  File "/share3/home/lufuzhen/code/kair/models/model_vrt.py", line 77, in optimize_parameters
    super(ModelVRT, self).optimize_parameters(current_step)
  File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 165, in optimize_parameters
    self.netG_forward()
  File "/share3/home/lufuzhen/code/kair/models/model_plain.py", line 158, in netG_forward
    self.E = self.netG(self.L)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1169, in forward
    feats = self.propagate(feats, flows, module_name, updated_flows)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 1061, in propagate
    feat_prop = feat_prop + self.backbone[module_name](torch.cat(feat, dim=2))
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 706, in forward
    return self.main(x)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/code/kair/models/network_rvrt.py", line 655, in forward
    return x + self.linear(self.residual_group(x).transpose(1, 4)).transpose(1, 4)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 1; 23.64 GiB total capacity; 4.58 GiB already allocated; 18.50 MiB free; 4.68 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[W reducer.cpp:325] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed.  This is not an error, but may impair performance.
grad.sizes() = [64, 192, 1, 1, 1], strides() = [192, 1, 192, 192, 192]
bucket_view.sizes() = [64, 192, 1, 1, 1], strides() = [192, 1, 1, 1, 1] (function operator())
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 2498673 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python
Traceback (most recent call last):
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/share3/home/lufuzhen/anaconda3/envs/rvrt/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
main_train_vrt.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2025-04-18_13:27:39
  host      : server-03
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 2498675)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-04-18_13:27:39
  host      : server-03
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2498674)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

error 1 :CUDA out of memory
I've successfully run REDS, but when running the DAVIS dataset for denoising, I get a 'CUDA out of memory' error. 
Here are the steps I've taken to address the issue:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128
Reduced dataloader_batch_size from 8 to 6, then to 3
Reduced dataloader_num_workers from 32 to 6, then to 2
Reduced num_frame from 16 to 4
error 2:ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python
Online research suggests that this PyTorch error is due to the dataset being too large. 

Has anyone else encountered the  similar problems?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python #31

main_train_vrt.py FAILED

Failures:
[1]:
time : 2025-04-18_13:27:39
host : server-03
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2498675)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 2498674) of binary: /share3/home/lufuzhen/anaconda3/envs/rvrt/bin/python #31

Description

main_train_vrt.py FAILED

Failures: [1]: time : 2025-04-18_13:27:39 host : server-03 rank : 2 (local_rank: 2) exitcode : 1 (pid: 2498675) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Failures:
[1]:
time : 2025-04-18_13:27:39
host : server-03
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 2498675)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html