DLPack integration not compatible with CUDA Graphs capture due to automatic synchronization #1244
Replies: 3 comments
-
|
Thank you for reporting the issue. I don't think that the proposed API is flexible enough. In general, the user might need to specify a custom stream. It is probably awkward to fuse such functionality into the cc @hpkfft |
Beta Was this translation helpful? Give feedback.
-
|
I would say this issue is better classified as an enhancement request rather than as a bug. I agree it would probably be better to create a new API rather than to fuse stream functionality into the existing Note that DLPack has recently added functions for a C exchange API: Finally, note that nanobind can create an ndarray from a capsule (i.e., from the result returned by One idea is to do this in your Python code. Instead of This can also be done in C++ using nanobind. I did not test this (nor what I wrote above), but something like: |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the thorough response @hpkfft. I will move this to the discussion tab since it is not a bug in existing functionality. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem description
When importing a DLPack-compatible object from Python into
nb::ndarray, the implementation directly calls the__dlpack__(stream=None)method without an explicit stream argument. This leads to the default stream being used in the array producer, and results in automatic synchronization happening in that stream. This is problematic when recording kernels inside a CUDA Graphs capture, since this is not allowed. You can prevent this by passingstream=-1, which will turn off any synchronization (the semantics of the stream parameter is defined in the dlpack docs)I’ve stumbled into this problem while trying to use CUDA C++ functions bound with nanobind during a CUDA Graph capture. More specifically, I’m using the Warp framework to write GPU code, and was in the process of converting some of the Warp kernels into CUDA C++. However I expect similar issues regardless of the framework used, since I’ve also tried out PyTorch with their CUDA Graph API and confirmed the same issue exists.
As to how to solve this problem, I think the simplest way would be to add a NoSync boolean template parameter to
nb::ndarray, so that if the option is turned on then thestream=-1parameter will be applied on the specific array. You could also instead add a field to thenb::ndarrayclass at runtime, but then this won’t work well with automatic bindings.Reproducible example code
test_warp.py:
test_pytorch.py:
cuda_ext.cu:
Beta Was this translation helpful? Give feedback.
All reactions