train vposer but torch.distributions.normal.Normal got Nan #70
Open
Description
I try to retrain VPoser on AMASS dataset which I downloaded from the official website, I follow the instruction of README but still got this weird error. After training about 200 epoch, the code line 56 of src/human_body_prior/models/vposer_model.py" torch.distributions.normal.Normal
turn to get the Nan value. It seems like it is caused by data issues.
I will appreciate it if anyone can figure out why and how, or give me any insight.
#training_jobs to be done: 1
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1580: UserWarning: GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.
rank_zero_warn(
[V08_16] -- Total Trainable Parameters Count in vp_model is 0.94 M.
| Name | Type | Params
---------------------------------------
0 | vp_model | VPoser | 936 K
1 | bm_train | BodyModel | 0
---------------------------------------
936 K Trainable params
0 Non-trainable params
936 K Total params
3.745 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.02 matrot:4.36 jtr:0.54 loss_total:5.95
Validation sanity check: 50%|█████████████████████████████████████████████████████████████ | 1/2 [00:02<00:02, 2.05s/it]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.34 jtr:0.53 loss_total:5.89
Validation sanity check: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.07it/s][V08_16] -- Epoch 0: val_loss:0.51
[V08_16] -- lr is [0.001]
Training: 0it [00:00, ?it/s][V08_16] -- Created a git archive backup at /data/hualin/vposer_train_gen/V08_16/code/vposer_2023_08_17_13_44_54.tar.gz
Epoch 0: 0%| | 0/7637 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.30 jtr:0.53 loss_total:5.86
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/closure.py:35: LightningDeprecationWarning: One of the returned values {'log', 'progress_bar'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
rank_zero_deprecation(
Epoch 0: 0%|
...
Epoch 0: 1%|█▌ | 107/7637 [00:25<30:11, 4.16it/s, loss=0.665, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.36 jtr:0.11 loss_total:0.69
Epoch 0: 1%|█▌ | 108/7637 [00:25<30:09, 4.16it/s, loss=0.663, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.37 jtr:0.11 loss_total:0.69
Epoch 0: 1%|█▌ | 109/7637 [00:26<30:08, 4.16it/s, loss=0.666, v_num=30]Traceback (most recent call last):
File "V02_05.py", line 55, in <module>
main()
File "V02_05.py", line 51, in main
train_vposer_once(job)
File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 361, in train_vposer_once
trainer.fit(model)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
return self._run_train()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
self.fit_loop.run()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 215, in advance
result = self._run_optimization(
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
lightning_module.optimizer_step(
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step
trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in optimizer_step
self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
optimizer.step(closure=closure, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/adam.py", line 183, in step
loss = closure()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure
closure_result = closure()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 160, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 142, in closure
step_output = self._step_fn()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 435, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 216, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 213, in training_step
return self.model.training_step(*args, **kwargs)
File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 232, in training_step
drec = self(batch['pose_body'].view(-1, 63))
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 107, in forward
return self.vp_model(pose_body)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 121, in forward
q_z = self.encode(pose_body)
File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 100, in encode
return self.encoder_net(pose_body)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin///vposer_66/src/human_body_prior/models/vposer_model.py", line 56, in forward
return torch.distributions.normal.Normal(self.mu(Xout), F.softplus(self.logvar(Xout)))
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/normal.py", line 56, in __init__
super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/distribution.py", line 56, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (128, 32)) of distribution Normal(loc: torch.Size([128, 32]), scale: torch.Size([128, 32])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], grad_fn=<AddmmBackward0>)
Metadata
Assignees
Labels
No labels