Skip to content

CUDA/CuDNN related errors occur in Titan-RTX environments #39

@dogyoonlee

Description

@dogyoonlee

hello.

I changed my environment in many ways,
but I couldn't get a solution for running your code...

First, my GPU is Titan-RTX
and my attempts are follows.

I also tried to run the code on CUDA 8.0 environments before, but the errors occurs as
almost same as on CUDA 9.0 environments


  1. ---environment---
    ubuntu 18.04
    CUDA 9.0
    CuDNN 7.1
    torch 0.3.1 / 0.4.0
    ==>
    error message :
    Found GPU0 TITAN RTX which requires CUDA_VERSION >= 9000 for
    optimal performance and fast startup time, but your PyTorch was compiled
    with CUDA_VERSION 8000. Please install the correct PyTorch binary
    using instructions from http://pytorch.org

warnings.warn(incorrect_binary_warn % (d, name, 9000, CUDA_VERSION))

and process is "Killed" when data are load to the gpu, specifically operating conv2d() command in
55 line of pointnet2_modules.py, self.mlp[i] - _PointnetSAModuleBase function

  1. ---environment---
    ubuntu 18.04
    CUDA 9.0
    CuDNN 7.1
    torch 0.3.1 / 0.4.1
    ==>
    error message :
    RuntimeError: cuda runtime error (11) : invalid argument at /pytorch/aten/src/THC/THCGeneral.cpp:663

  2. ---environment---
    ubuntu 18.04
    CUDA 9.0
    CuDNN 7.1
    torch 0.3.1 / 0.4.1

and I additionally revised train_cls.py as

torch.backends.cudnn.benchmark = False

==>
Traceback (most recent call last):
File "train_cls.py", line 217, in
main()
File "train_cls.py", line 125, in main
train(train_dataloader, test_dataloader, model, criterion, optimizer, lr_scheduler, bnm_scheduler, args, num_batch)
File "train_cls.py", line 167, in train
pred = model(points)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/SSD1/dogyoon/Relation-Shape-CNN-master/models/rscnn_ssn_cls.py", line 102, in forward
return self.FC_layer(features.squeeze(-1))
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/container.py", line 91, in forward
input = module(input)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/modules/batchnorm.py", line 66, in forward
exponential_average_factor, self.eps)
File "/home/mvpserverone/.conda/envs/rscnn/lib/python3.5/site-packages/torch/nn/functional.py", line 1251, in batch_norm
raise ValueError('Expected more than 1 value per channel when training, got input size {}'.format(size))
ValueError: Expected more than 1 value per channel when training, got input size [1, 512]


I really hope to find the solution of this problem as soon as possible
thank you very much

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions