DLPack integration not compatible with CUDA Graphs capture due to automatic synchronization #1244

phil-holiday · 2025-12-10T04:46:22Z

phil-holiday
Dec 10, 2025

Problem description

When importing a DLPack-compatible object from Python into nb::ndarray, the implementation directly calls the __dlpack__(stream=None) method without an explicit stream argument. This leads to the default stream being used in the array producer, and results in automatic synchronization happening in that stream. This is problematic when recording kernels inside a CUDA Graphs capture, since this is not allowed. You can prevent this by passing stream=-1, which will turn off any synchronization (the semantics of the stream parameter is defined in the dlpack docs)

I’ve stumbled into this problem while trying to use CUDA C++ functions bound with nanobind during a CUDA Graph capture. More specifically, I’m using the Warp framework to write GPU code, and was in the process of converting some of the Warp kernels into CUDA C++. However I expect similar issues regardless of the framework used, since I’ve also tried out PyTorch with their CUDA Graph API and confirmed the same issue exists.

As to how to solve this problem, I think the simplest way would be to add a NoSync boolean template parameter to nb::ndarray , so that if the option is turned on then thestream=-1 parameter will be applied on the specific array. You could also instead add a field to the nb::ndarray class at runtime, but then this won’t work well with automatic bindings.

Reproducible example code

test_warp.py:

import warp as wp
wp.init()

import cuda_ext

data = wp.zeros(1024, dtype=wp.float32, device="cuda:0")
stream = wp.Stream(device="cuda:0")

wp.capture_begin(device="cuda:0", stream=stream)
cuda_ext.launch_kernel(data, stream.cuda_stream)
graph = wp.capture_end(device="cuda:0", stream=stream)

wp.capture_launch(graph, stream=stream)

test_pytorch.py:

import torch
import cuda_ext

data = torch.zeros(1024, dtype=torch.float32, device="cuda")
stream = torch.cuda.Stream()

# Warmup
with torch.cuda.stream(stream):
    cuda_ext.launch_kernel(data, stream.cuda_stream)
stream.synchronize()

# Capture
graph = torch.cuda.CUDAGraph()
with torch.cuda.graph(graph, stream=stream):
    cuda_ext.launch_kernel(data, stream.cuda_stream)

graph.replay()

cuda_ext.cu:

#include <nanobind/nanobind.h>
#include <nanobind/ndarray.h>
#include <cuda_runtime.h>

namespace nb = nanobind;

__global__ void kernel(float* data, int n) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < n) data[i] += 1.0f;
}

void launch_kernel(nb::ndarray<float, nb::device::cuda> data, uintptr_t stream) {
    int n = data.size();
    kernel<<<(n + 255) / 256, 256, 0, (cudaStream_t)stream>>>(data.data(), n);
}

NB_MODULE(cuda_ext, m) {
    m.def("launch_kernel", &launch_kernel);
}

wjakob · 2025-12-10T10:59:40Z

wjakob
Dec 10, 2025
Maintainer

Thank you for reporting the issue. I don't think that the proposed API is flexible enough. In general, the user might need to specify a custom stream.

It is probably awkward to fuse such functionality into the nb::ndarray API. But perhaps nanobind could expose a lower-level API that would enable such customization.

cc @hpkfft

0 replies

hpkfft · 2025-12-11T05:36:09Z

hpkfft
Dec 11, 2025

I would say this issue is better classified as an enhancement request rather than as a bug.

I agree it would probably be better to create a new API rather than to fuse stream functionality into the existing nb::ndarray API. It would take some time and effort to get the design right. I think it would be worth trying to understand the current (and planned) support by various frameworks (e.g., PyTorch as mentioned above) and by GPU devices (CUDA, ROCM, OneAPI, etc.).

Note that DLPack has recently added functions for a C exchange API:
https://github.com/dmlc/dlpack/blob/93c8f2a3c774b84af6f652b1992c48164fae60fc/include/dlpack/dlpack.h#L411

Finally, note that nanobind can create an ndarray from a capsule (i.e., from the result returned by __dlpack__()), in which case nanobind does not itself call __dlpack__().

One idea is to do this in your Python code. Instead of cuda_ext.launch_kernel(data, stream.cuda_stream), use:

cuda_ext.launch_kernel(data.__dlpack__(max_version=(1,0), stream=-1), stream.cuda_stream)

This can also be done in C++ using nanobind. I did not test this (nor what I wrote above), but something like:

void launch_kernel(nb::object obj, uintptr_t stream) {
    nb::tuple max = nb::make_tuple(1, 0);
    nb::object capsule = obj.attr("__dlpack__")(nb::arg("max_version") = max,
                                                nb::arg("stream") = -1);
    nb::ndarray<float, nb::device::cuda> data
            = nb::cast<nb::ndarray<float, nb::device::cuda>>(capsule);
    int n = data.size();
    kernel<<<(n + 255) / 256, 256, 0, (cudaStream_t)stream>>>(data.data(), n);
}

0 replies

wjakob · 2025-12-11T11:18:23Z

wjakob
Dec 11, 2025
Maintainer

Thank you for the thorough response @hpkfft. I will move this to the discussion tab since it is not a bug in existing functionality.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DLPack integration not compatible with CUDA Graphs capture due to automatic synchronization #1244

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

DLPack integration not compatible with CUDA Graphs capture due to automatic synchronization #1244

Uh oh!

Uh oh!

phil-holiday Dec 10, 2025

Problem description

Reproducible example code

Replies: 3 comments

Uh oh!

wjakob Dec 10, 2025 Maintainer

Uh oh!

hpkfft Dec 11, 2025

Uh oh!

wjakob Dec 11, 2025 Maintainer

phil-holiday
Dec 10, 2025

wjakob
Dec 10, 2025
Maintainer

hpkfft
Dec 11, 2025

wjakob
Dec 11, 2025
Maintainer