How to train this model in one node multi-gpus mode?

Thanks for your project.

My eviroument is Ubuntu16.04+Python3.6 +Pytorch1.1+CUDA10.0

I try to use this code to train distributed
`python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 train_niqe.py -opt options/train/train_AdaGrowingNet.yml --launcher pytorch`

First, for VGGFeatureExtractor, I got this error:
`RuntimeError: replicas_[0].size() >= 1 ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53, please report a bug to PyTorch. Expected at least one parameter. (Reducer at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f27c47be441 in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f27c47bdd7a in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::Reducer(std::vector<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >, std::allocator<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > > >, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >, std::shared_ptr<c10d::ProcessGroup>) + 0x199c (0x7f280405fc1c in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libtorch_python.so)`

Then I set the parameters of netF: v.requires_grad = False; After `self.netF = DistributedDataParallel(self.netF, device_ids=[torch.cuda.current_device()])`.
While this code is first at the define of the VGGFeatureExtractor.
So, this error disappeared.

Then I still run this code,
But it got RuntimeError.
`Traceback (most recent call last):
  File "train_niqe.py", line 260, in <module>
    main()
  File "train_niqe.py", line 172, in main
    model.optimize_parameters(current_step)
  File "/home/wangzhan/SRtask/data_augment/RankSRGAN-master/codes/models/RankSRGAN_model.py", line 215, in optimize_parameters
    l_d_total.backward()
  File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).`

Do you encounter this problem?
How to fix it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to train this model in one node multi-gpus mode? #14

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to train this model in one node multi-gpus mode? #14

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions