Skip to content

RuntimeError when training on multiple GPUs #1

@Walid-Ked

Description

@Walid-Ked

I'm trying to train the model from scratch on a custom subset of Imagenet, the training works fine on a single gpu, but when running on multiple gpus I get the following error:

Expected tensor for argument #1 'input' to have the same device as tensor for argument #2 'weight'; but device 3 does not equal 0 (while checking arguments for cudnn_batch_norm)

my configuration file looks like this:

name: train_colorformer
model_type: LABGANRGBModel
scale: 1
num_gpu: 4
manual_seed: 0
queue_size: 64

and I'm using CUDA_VISIBLE_DEVICES to specify the gpus to be used.
I tried looking for any inputs that are not moved to cuda but without success.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions