Multi-GPU setup with PyTorch #1691

pixelsandpointers · 2025-07-26T08:48:03Z

pixelsandpointers
Jul 26, 2025

Dear all,

I am currently looking for a solution to utilize multiple GPUs by distributing them between workloads.
My pipeline uses some PyTorch models, which require multiple GPUs to compute forward and backward passes, and Mitsuba as the differentiable renderer to optimize the scene.

I found this discussion: mitsuba-renderer/drjit#359. But using the CUDA_VISIBLE_DEVICES environment variable also restricts PyTorch from finding specific GPUs.

Let's say, Mitsuba takes one GPU, and the model (we don't train this) takes two other GPUs. I would need to transfer the gradients from the other two GPUs to the one with Mitsuba.

What would be the best way in a cluster with four GPUs to use Mitsuba in an optimization pipeline using a larger model that does not fit on the same GPU?

Thanks a lot,
Ben

I have come up with the following code using multiprocessing and IPC. Unfortunately, it ends up in:

RuntimeError: drjit.backward_from(): the argument does not depend on the input variable(s) being differentiated. 
Raising an exception since this is usually indicative of a bug (for example, you may have forgotten to call dr.enable_grad(..)). 
If this is expected behavior, provide the drjit.ADFlag.AllowNoGrad flag to the function (e.g., by specifying flags=dr.ADFlag.Default | dr.ADFlag.AllowNoGrad).

What am I doing wrong?

This is the example:

import multiprocessing as mp


def mitsuba_worker(render_request, render_response, grad_queue, gt_queue):
    import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    import mitsuba as mi
    import drjit as dr

    mi.set_variant("cuda_ad_rgb")

    # Build scene
    scene = mi.load_dict(mi.cornell_box())
    params = mi.traverse(scene)
    gt = mi.render(scene)
    gt_queue.put(gt.numpy())

    optim = mi.ad.Adam(0.1)
    col = mi.Color3f(0.205421, 0.47798, 0.176425)
    dr.enable_grad(col)
    optim["green.reflectance.value"] = col
    params.update(optim)

    for step in range(10):
        msg = render_request.get()
        if msg == "render":
            image = mi.render(scene, spp=8)
            render_response.put(image.numpy())  # Send rendered image

            # Receive ∂L/∂image from PyTorch
            grad_image = grad_queue.get()
            dr.set_grad(image, grad_image)
            dr.backward(image) # or dr.backward() doesn't change anything
            optim.step()


            optim["green.reflectance.value"] = dr.clip(
                optim["green.reflectance.value"], 0.0, 1.0
            )
            params.update(optim)


            print(f"[Mitsuba] Step {step}")


def pytorch_worker(render_request, render_response, grad_queue, gt_queue):
    import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "1"

    import torch
    import torch.nn.functional as F

    target = torch.tensor(gt_queue.get())

    for step in range(10):
        render_request.put("render")
        image_np = render_response.get()

        image = torch.tensor(image_np, requires_grad=True)

        loss = F.mse_loss(image, target)
        loss.backward()

        # Send ∂L/∂image to Mitsuba
        grad = image.grad
        grad_queue.put(grad)

        print(f"[PyTorch] Step {step}, Loss = {loss.item():.6f}")


if __name__ == "__main__":
    mp.set_start_method("spawn")

    gt_queue = mp.Queue()
    render_request = mp.Queue()
    render_response = mp.Queue()
    grad_queue = mp.Queue()

    p_mi = mp.Process(
        target=mitsuba_worker,
        args=(render_request, render_response, grad_queue, gt_queue),
    )
    p_torch = mp.Process(
        target=pytorch_worker,
        args=(render_request, render_response, grad_queue, gt_queue),
    )

    p_mi.start()
    p_torch.start()

    p_mi.join()
    p_torch.join()

Answered by pixelsandpointers

Jul 28, 2025

Hi @njroussel,

I did that as well and have been playing with it for some time, and eventually got something working this morning. Do you think the "right way" to do this is using Dr.Jit?

import multiprocessing as mp


def mitsuba_worker(render_request, render_response, grad_queue, gt_queue):
    import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    import mitsuba as mi
    import drjit as dr

    mi.set_variant("cuda_ad_rgb")

    # Build scene
    sd = mi.cornell_box()
    sd["integrator"] = {"type": "prb"}
    scene = mi.load_dict(sd)
    params = mi.traverse(scene)
    gt = mi.render(scene)
    gt_queue.put(gt.numpy())

    optim = mi.ad.Adam(0.1)
    col = mi.Color3f(0.205421, 0…

View full answer

njroussel · 2025-07-28T09:34:16Z

njroussel
Jul 28, 2025
Collaborator

Hi @pixelsandpointers

I think your issue is in the mi.render() call, you should be passing the list of parameters that requires gradient tracking. Usually we lazily write: image = mi.render(scene, params=params, spp=8).

Back to your initial question, we don't plan to support this type of multi-GPU workloads. We don't really have a need for it ourselves and we don't have the infrastructure to actually test it. So I don't really have any good guidelines or tips to share on this topic.
In theory, this is totally feasible, and I'm pretty certain that you can create something fairly functional & powerful with some light hacking and extra interfaces in Dr.Jit.

3 replies

pixelsandpointers Jul 28, 2025
Author

Hi @njroussel,

I did that as well and have been playing with it for some time, and eventually got something working this morning. Do you think the "right way" to do this is using Dr.Jit?

import multiprocessing as mp


def mitsuba_worker(render_request, render_response, grad_queue, gt_queue):
    import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "0"

    import mitsuba as mi
    import drjit as dr

    mi.set_variant("cuda_ad_rgb")

    # Build scene
    sd = mi.cornell_box()
    sd["integrator"] = {"type": "prb"}
    scene = mi.load_dict(sd)
    params = mi.traverse(scene)
    gt = mi.render(scene)
    gt_queue.put(gt.numpy())

    optim = mi.ad.Adam(0.1)
    col = mi.Color3f(0.205421, 0.47798, 0.176425)
    optim["green.reflectance.value"] = col
    params.update(optim)

    for step in range(10):
        msg = render_request.get()
        if msg == "render":
            image = mi.render(scene, params, spp=8)
            render_response.put(image.numpy())  # Send rendered image

            # Receive ∂L/∂image from PyTorch
            grad_image = grad_queue.get()
            dr.replace_grad(image, grad_image)
            dr.backward(image)
            optim.step()

            optim["green.reflectance.value"] = dr.clip(
                optim["green.reflectance.value"], 0.0, 1.0
            )
            params.update(optim)
            print(f"[Mitsuba] Step {step}, [Val] {params['green.reflectance.value']}")


def pytorch_worker(render_request, render_response, grad_queue, gt_queue):
    import os

    os.environ["CUDA_VISIBLE_DEVICES"] = "1"

    import torch
    import torch.nn as nn
    import torch.nn.functional as F

    # Optional
    # class NN(nn.Module):
    #     def __init__(self):
    #         super().__init__()
    #         self.ff = nn.LazyLinear(3)

    #     def forward(self, x):
    #         return torch.relu(self.ff(x.flatten()))

    # net = NN()
    # target = net(torch.tensor(gt_queue.get()))

    target = torch.tensor(gt_queue.get())

    for step in range(10):
        render_request.put("render")
        image_np = render_response.get()

        image = torch.tensor(image_np, requires_grad=True)

        # Optional
        # pred = net(image)

        loss = F.mse_loss(image, target)
        loss.backward(retain_graph=True)

        # Send ∂L/∂image to Mitsuba
        grad = image.grad
        grad_queue.put(grad)

        print(f"[PyTorch] Step {step}, Loss = {loss.item():.6f}")


if __name__ == "__main__":
    mp.set_start_method("spawn")

    gt_queue = mp.Queue()
    render_request = mp.Queue()
    render_response = mp.Queue()
    grad_queue = mp.Queue()

    p_mi = mp.Process(
        target=mitsuba_worker,
        args=(render_request, render_response, grad_queue, gt_queue),
    )
    p_torch = mp.Process(
        target=pytorch_worker,
        args=(render_request, render_response, grad_queue, gt_queue),
    )

    p_mi.start()
    p_torch.start()

    p_mi.join()
    p_torch.join()

Answer selected by pixelsandpointers

njroussel Jul 28, 2025
Collaborator

Do you think the "right way" to do this is using Dr.Jit?

I'm assuming you're refering to my last part of my comment: " In theory, this is totally feasible, and I'm pretty certain that you can create something fairly functional & powerful with some light hacking and extra interfaces in Dr.Jit."

Yes. There's some heavy blocking device<->host copies going on in the current code. These are not necessary. Using the dlpack protocol and a single process, you should in theory be able to share data structures asynchronously across frameworks by carefully specifying CUDA devices & streams. Whether or not this is a significant overhead, obviously depends on your application.

pixelsandpointers Jul 28, 2025
Author

Thanks! I'll look into it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU setup with PyTorch #1691

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Multi-GPU setup with PyTorch #1691

Uh oh!

Uh oh!

pixelsandpointers Jul 26, 2025

Replies: 1 comment · 3 replies

Uh oh!

njroussel Jul 28, 2025 Collaborator

Uh oh!

Uh oh!

pixelsandpointers Jul 28, 2025 Author

Uh oh!

njroussel Jul 28, 2025 Collaborator

Uh oh!

pixelsandpointers Jul 28, 2025 Author

pixelsandpointers
Jul 26, 2025

Replies: 1 comment 3 replies

njroussel
Jul 28, 2025
Collaborator

pixelsandpointers Jul 28, 2025
Author

njroussel Jul 28, 2025
Collaborator

pixelsandpointers Jul 28, 2025
Author