Skip to content

torch.nn.DataParallel with OperatorModule wrapping 'astra_cuda'-RayTransform malfunctioning #1545

Open
@jleuschn

Description

@jleuschn

There seem to be problems when using the 'astra_cuda' backend wrapped by a RayTransform operator again wrapped by odl.contrib.torch.operator.OperatorModule when trying to distribute the model on multiple GPUs using torch.nn.DataParallel.
I don't have a specific error message atm, but it lead to some kernel panics on different servers.
My guess would be that it is related to the copying performed by DataParallel, which i imagine to result in problems like conflicting shared memory usage or double freeing.
Does someone have more knowledge on why this is happening or how to make it work?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions