Skip to content

train vposer but torch.distributions.normal.Normal got Nan  #70

Open
@lithiumice

Description

I try to retrain VPoser on AMASS dataset which I downloaded from the official website, I follow the instruction of README but still got this weird error. After training about 200 epoch, the code line 56 of src/human_body_prior/models/vposer_model.py" torch.distributions.normal.Normal turn to get the Nan value. It seems like it is caused by data issues.

I will appreciate it if anyone can figure out why and how, or give me any insight.

#training_jobs to be done: 1
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1580: UserWarning: GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.
  rank_zero_warn(
[V08_16] -- Total Trainable Parameters Count in vp_model is 0.94 M.

  | Name     | Type      | Params
---------------------------------------
0 | vp_model | VPoser    | 936 K 
1 | bm_train | BodyModel | 0     
---------------------------------------
936 K     Trainable params
0         Non-trainable params
936 K     Total params
3.745     Total estimated model params size (MB)
Validation sanity check:   0%|                                                                                                                                  | 0/2 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.02 matrot:4.36 jtr:0.54 loss_total:5.95
Validation sanity check:  50%|█████████████████████████████████████████████████████████████                                                             | 1/2 [00:02<00:02,  2.05s/it]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.34 jtr:0.53 loss_total:5.89
Validation sanity check: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.07it/s][V08_16] -- Epoch 0: val_loss:0.51
[V08_16] -- lr is [0.001]
Training: 0it [00:00, ?it/s][V08_16] -- Created a git archive backup at /data/hualin/vposer_train_gen/V08_16/code/vposer_2023_08_17_13_44_54.tar.gz                                   
Epoch 0:   0%|                                                                                                                                               | 0/7637 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.30 jtr:0.53 loss_total:5.86
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/closure.py:35: LightningDeprecationWarning: One of the returned values {'log', 'progress_bar'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
  rank_zero_deprecation(
Epoch 0:   0%|        
...
Epoch 0:   1%|█▌                                                                                                             | 107/7637 [00:25<30:11,  4.16it/s, loss=0.665, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.36 jtr:0.11 loss_total:0.69
Epoch 0:   1%|█▌                                                                                                             | 108/7637 [00:25<30:09,  4.16it/s, loss=0.663, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.37 jtr:0.11 loss_total:0.69
Epoch 0:   1%|█▌                                                                                                             | 109/7637 [00:26<30:08,  4.16it/s, loss=0.666, v_num=30]Traceback (most recent call last):
  File "V02_05.py", line 55, in <module>
    main()
  File "V02_05.py", line 51, in main
    train_vposer_once(job)
  File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 361, in train_vposer_once
    trainer.fit(model)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
    self._call_and_handle_interrupt(
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
    self._dispatch()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
    self.training_type_plugin.start_training(self)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
    self._results = trainer.run_stage()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
    return self._run_train()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
    self.fit_loop.run()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
    self.epoch_loop.run(data_fetcher)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
    batch_output = self.batch_loop.run(batch, batch_idx)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
    outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
    self.advance(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 215, in advance
    result = self._run_optimization(
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization
    self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
    lightning_module.optimizer_step(
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step
    trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in optimizer_step
    self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
    optimizer.step(closure=closure, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
    out = func(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/adam.py", line 183, in step
    loss = closure()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure
    closure_result = closure()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 160, in __call__
    self._result = self.closure(*args, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 142, in closure
    step_output = self._step_fn()
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 435, in _training_step
    training_step_output = self.trainer.accelerator.training_step(step_kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 216, in training_step
    return self.training_type_plugin.training_step(*step_kwargs.values())
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 213, in training_step
    return self.model.training_step(*args, **kwargs)
  File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 232, in training_step
    drec = self(batch['pose_body'].view(-1, 63))
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 107, in forward
    return self.vp_model(pose_body)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 121, in forward
    q_z = self.encode(pose_body)
  File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 100, in encode
    return self.encoder_net(pose_body)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
    input = module(input)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/hualin///vposer_66/src/human_body_prior/models/vposer_model.py", line 56, in forward
    return torch.distributions.normal.Normal(self.mu(Xout), F.softplus(self.logvar(Xout)))
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/normal.py", line 56, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/distribution.py", line 56, in __init__
    raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (128, 32)) of distribution Normal(loc: torch.Size([128, 32]), scale: torch.Size([128, 32])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], grad_fn=<AddmmBackward0>)

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions