Skip to content

How to train this model in one node multi-gpus mode? #14

@trillionpowers

Description

@trillionpowers

Thanks for your project.

My eviroument is Ubuntu16.04+Python3.6 +Pytorch1.1+CUDA10.0

I try to use this code to train distributed
python -m torch.distributed.launch --nproc_per_node=2 --master_port=4321 train_niqe.py -opt options/train/train_AdaGrowingNet.yml --launcher pytorch

First, for VGGFeatureExtractor, I got this error:
RuntimeError: replicas_[0].size() >= 1 ASSERT FAILED at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53, please report a bug to PyTorch. Expected at least one parameter. (Reducer at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:53) frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f27c47be441 in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f27c47bdd7a in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libc10.so) frame #2: c10d::Reducer::Reducer(std::vector<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> >, std::allocator<std::vector<torch::autograd::Variable, std::allocator<torch::autograd::Variable> > > >, std::vector<std::vector<unsigned long, std::allocator<unsigned long> >, std::allocator<std::vector<unsigned long, std::allocator<unsigned long> > > >, std::shared_ptr<c10d::ProcessGroup>) + 0x199c (0x7f280405fc1c in /home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Then I set the parameters of netF: v.requires_grad = False; After self.netF = DistributedDataParallel(self.netF, device_ids=[torch.cuda.current_device()]).
While this code is first at the define of the VGGFeatureExtractor.
So, this error disappeared.

Then I still run this code,
But it got RuntimeError.
Traceback (most recent call last): File "train_niqe.py", line 260, in <module> main() File "train_niqe.py", line 172, in main model.optimize_parameters(current_step) File "/home/wangzhan/SRtask/data_augment/RankSRGAN-master/codes/models/RankSRGAN_model.py", line 215, in optimize_parameters l_d_total.backward() File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/home/wangzhan/anaconda3/envs/py36_pt10_tf14/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 4; expected version 3 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Do you encounter this problem?
How to fix it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions