Skip to content

FITS/FILM/GP-VAE fail when running on multiple CUDA devices #632

@WenjieDu

Description

@WenjieDu

1. System Info

v0.11

2. Information

  • The official example scripts
  • My own created scripts

3. Reproduction

  • pypots.clustering.crli
  • pypots.imputation.usgan
  • pypots.imputation.koopa
  • pypots.imputation.film
  • pypots.imputation.gpvae
  • pypots.imputation.fits
  • pypots.forecasting.fits

4. Expected behavior

For pypots.forecasting.fits and pypots.imputation.fits we have

E       RuntimeError: Caught RuntimeError in replica 0 on device 1.
E       Original Traceback (most recent call last):
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
E           output = module(*input, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/forecasting/fits/core.py", line 68, in forward
E           enc_out = self.backbone(enc_out)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/fits/backbone.py", line 63, in forward
E           low_specxy_ = self.freq_upsampler(low_specx.permute(0, 2, 1))
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
E           return F.linear(input, self.weight, self.bias)
E       RuntimeError: t() expects a tensor with <= 2 dimensions, but self is 3D

For pypots.imputation.film we have

E       RuntimeError: Caught RuntimeError in replica 0 on device 1.
E       Original Traceback (most recent call last):
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
E           output = module(*input, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/imputation/film/core.py", line 65, in forward
E           backbone_output = self.backbone(X_embedding)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/film/backbone.py", line 65, in forward
E           out1 = self.spec_conv_1[i](x_in_c)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/film/layers.py", line 128, in forward
E           out_ft[:, :, :, : self.modes2] = torch.einsum("bjix,iox->bjox", a, self.weights1)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/functional.py", line 380, in einsum
E           return _VF.einsum(equation, operands)  # type: ignore[attr-defined]
E       RuntimeError: einsum(): the number of subscripts in the equation (3) does not match the number of dimensions (4) for operand 1 and no ellipsis was given

For pypots.imputation.gpvae we have

E       RuntimeError: Caught RuntimeError in replica 1 on device 2.
E       Original Traceback (most recent call last):
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in _worker
E           output = module(*input, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/imputation/gpvae/core.py", line 97, in forward
E           elbo_loss = self.backbone(X, missing_mask)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
E           return self._call_impl(*args, **kwargs)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
E           return forward_call(*args, **kwargs)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/gpvae/backbone.py", line 157, in forward
E           self.prior = self._init_prior(device=X.device)
E         File "/home/wdudu/PyPOTS_dev/pypots/nn/modules/gpvae/backbone.py", line 137, in _init_prior
E           prior = torch.distributions.MultivariateNormal(
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py", line 177, in __init__
E           super().__init__(batch_shape, event_shape, validate_args=validate_args)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/distributions/distribution.py", line 66, in __init__
E           valid = constraint.check(value)
E         File "/home/wdudu/.conda/envs/ml/lib/python3.10/site-packages/torch/distributions/constraints.py", line 557, in check
E           return torch.linalg.cholesky_ex(value).info.eq(0)
E       RuntimeError: lazy wrapper should be called at most once

for others

they have 'DataParallel' object has no attribute 'backbone'

Metadata

Metadata

Labels

bugSomething isn't workinghelp wantedExtra attention is neededkeepKeep this issue away from being stale.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions