Add PyTorch autograd tests and fix interop/copy-back bugs#781
Add PyTorch autograd tests and fix interop/copy-back bugs#781jhelferty-nv wants to merge 34 commits intoshader-slang:mainfrom
Conversation
…ion increments Addresses a test failure found in shader-slang#733 SlangPy was copying data back to ALL PyTorch tensors after kernel execution, including read-only inputs. This copy used PyTorch's copy_() method which increments the tensor's _version counter, breaking autograd's version tracking and causing RuntimeError during backward pass for interleaved SlangPy/PyTorch operations. The fix checks the binding's access type before copying back - only tensors with write or readwrite access get data copied back, not read-only inputs. Co-authored-by: Cursor <cursoragent@cursor.com>
ccummingsNV
left a comment
There was a problem hiding this comment.
Good set of tests, though obviously they need to pass before we can merge.
We should add tests that pass full tensors in as well as tests that just take scalar variables, as the interop logic is subtly different.
It would also be nice to see if we can get rid of the need to wrap slangpy functions in torch modules to work. Our goal is for a user to integrate seemlessly - creating custom wrapper classes isn't ideal.
| AccessType primal_access = binding->access().first; | ||
| bool primal_is_writable = (primal_access == AccessType::write || primal_access == AccessType::readwrite); | ||
| bool needs_primal_copyback = m_writable && primal_is_writable && primal_info.numel > 0 && primal_interop_buffer; | ||
| bool needs_grad_copyback = has_grad && grad_info.numel > 0 && grad_interop_buffer; |
There was a problem hiding this comment.
I'm not convinced this logic is correct. It works for a scalar during broadcasting (i.e. pass tensor to float), but what if it's passing a tensor to a tensor. the access type would probably be 'read' (its just reading the tensor info) even though the tensor is written to.
There was a problem hiding this comment.
Confirmed, I'll push a fix and tests for this.
There was a problem hiding this comment.
I'm not happy with this - it's getting too complex, and will slow down the dispatch. The simplest is just to check if it is writable, though as you point out that could result in redundant copying.
Probably the correct logic, if we want to do it, would be to adjust the logic that caches offsets to look at the slang uniform type that is being bound to, and identify it is a writable tensor type (by name). This could work for gradient based tensors too. i.e. in that situation we're saying:
- if the user has explicitly written a function that takes a writable tensor, assume it is being written to
- otherwise, if the user has written a scalar function we're vectorizing across, and it has an 'out/inout' param, assume it is being written to
Exactly the same principle could be applied to the gradients, making that logic more robust/optimal as well.
As that code only happens once, it would not slow down dispatches. (though we'd want to rename some bits so it didn't claim to just be caching offsets when it in fact was caching more than that!).
Ultimately the actual decision making for this is in tensorcommon.py, where we generate the call data for the tensor. This is where we decide what actual tensor type the generated kernel uses. Another option would be to write some boolean value 'needs copying' inside gen_calldata, but that feels wrong to me - gen_calldata by implication is immutable, and many design decisions are based on the idea that a marshall is complete from the moment it's created.
We probably also need to apply this logic to the normal tensor right?
There was a problem hiding this comment.
Okay, I've gone ahead and moved the logic to cache-time. We do need this fixed, because more complex cases where a Sequential jumps back and forth between SlangPy and PyTorch will generate an assert if we end up copying a read-only input tensor. e.g.,
RuntimeError: one of the variables needed for gradient computation has been modified
by an inplace operation: [torch.cuda.FloatTensor [32, 16]], which is output 0 of
TanhBackward0, is at version 2; expected version 0 instead.
The cache-time logic uses TypeReflection::Kind instead of the type name ("RWTensor" or "WTensor").. hopefully this aligns with what you were thinking?
As for normal tensors, IIUC, the normal SlangPy Tensor doesn't need copy-back because the shader writes directly to the tensor's GPU buffer (via device_address()). The interop buffer / copy-back mechanism is specifically for PyTorch tensors on non-CUDA backends, where SlangPy can't write directly to PyTorch's CUDA memory from a Vulkan/D3D12 shader.
Or as Claude puts it, the asymmetry is:
- SlangPy buffer on Vulkan/D3D12: The buffer is native to Vulkan/D3D12. The shader writes directly to it. cuda_memory() provides a CUDA view of this same memory for interop (if needed).
- PyTorch tensor on Vulkan/D3D12: The tensor is native to CUDA (PyTorch created it). Vulkan/D3D12 shaders can't write to CUDA-owned memory. So we create a SlangPy interop buffer (which Vulkan/D3D12 can access), run the shader, then copy the results back to PyTorch via CUDA.
Tests for copyback logic in PyTorch integration.
…eters This fixes an issue with commit 6edd1b6 which used AccessType to prevent spurious copy-back. That approach was incomplete: - AccessType describes the *parameter binding* (in/out/inout modifiers) - It does NOT describe whether the underlying *tensor data* is writable The original fix worked for scalar broadcast (tensor → float) but broke tensor parameters (tensor → RWTensor<T,N>) because: - RWTensor parameter: binding is "in RWTensor<...>" → AccessType::read - But the tensor DATA is writable and must be copied back The new approach checks the *target type* instead: - Simple types (scalar/vector/matrix): Broadcast case, tensor is read-only → Skip copy-back (prevents autograd _version increment) - Complex types (struct/resource): May include tensor types → Use m_writable to determine copy-back (preserves output data) This correctly handles both cases: 1. Tensor → float: No copy-back, autograd works 2. Tensor → RWTensor: Copy-back occurs, output data preserved Adds tests for both cases to prevent regression. Co-authored-by: Cursor <cursoragent@cursor.com>
The previous fix (b590ae3) incorrectly skipped copy-back for ALL tensors bound to simple types (scalar/vector/matrix), including outputs. For scalar functions like `float slang_relu(float x)`: - INPUT tensor → float: Read-only broadcast, skip copy-back (correct) - OUTPUT tensor → float: Contains results, MUST copy back (was broken) The fix now checks BOTH conditions to skip copy-back: 1. Target type is simple (scalar/vector/matrix) 2. Access type is read-only (not write or readwrite) This ensures outputs (which have write access) are always copied back, while inputs (read access) skip the copy-back that would break autograd. Co-authored-by: Cursor <cursoragent@cursor.com>
- Remove numbered test case labels (Case 1, 2a, 2b) - Remove references to specific reviewers in comments - Improve section header clarity Co-authored-by: Cursor <cursoragent@cursor.com>
ccummingsNV
left a comment
There was a problem hiding this comment.
Hi. This is great work - sorry to be picky on the binding - it's just important we absolutely nail that particular logic, both in terms of correctness and performance.
I've detailed what I think is the only robust path for us now - ultimately looking at the type that is being bound to, which we conveniently do have access to when calculating offsets. I think what the actual slang side type that we're binding the tensor to (note not the same as the argument type of the user's function) is the concrete answer to when/whether copies are needed.
| AccessType primal_access = binding->access().first; | ||
| bool primal_is_writable = (primal_access == AccessType::write || primal_access == AccessType::readwrite); | ||
| bool needs_primal_copyback = m_writable && primal_is_writable && primal_info.numel > 0 && primal_interop_buffer; | ||
| bool needs_grad_copyback = has_grad && grad_info.numel > 0 && grad_interop_buffer; |
There was a problem hiding this comment.
I'm not happy with this - it's getting too complex, and will slow down the dispatch. The simplest is just to check if it is writable, though as you point out that could result in redundant copying.
Probably the correct logic, if we want to do it, would be to adjust the logic that caches offsets to look at the slang uniform type that is being bound to, and identify it is a writable tensor type (by name). This could work for gradient based tensors too. i.e. in that situation we're saying:
- if the user has explicitly written a function that takes a writable tensor, assume it is being written to
- otherwise, if the user has written a scalar function we're vectorizing across, and it has an 'out/inout' param, assume it is being written to
Exactly the same principle could be applied to the gradients, making that logic more robust/optimal as well.
As that code only happens once, it would not slow down dispatches. (though we'd want to rename some bits so it didn't claim to just be caching offsets when it in fact was caching more than that!).
Ultimately the actual decision making for this is in tensorcommon.py, where we generate the call data for the tensor. This is where we decide what actual tensor type the generated kernel uses. Another option would be to write some boolean value 'needs copying' inside gen_calldata, but that feels wrong to me - gen_calldata by implication is immutable, and many design decisions are based on the idea that a marshall is complete from the moment it's created.
We probably also need to apply this logic to the normal tensor right?
Add tests to verify gradient interop buffer copy-back works correctly: - test_gradient_copyback_single_op: Single input gradient verification - test_gradient_copyback_multiple_inputs: Multiple inputs with gradient values These tests will serve as regression tests for the upcoming refactoring that moves copy-back decisions from dispatch time to cache time. Co-authored-by: Cursor <cursoragent@cursor.com>
Move the torch tensor copy-back decision from runtime dispatch to the offset caching phase. This avoids expensive runtime type reflection (vector_type()->type_reflection()->kind()) on every dispatch. Changes: - Add needs_primal_copyback and needs_grad_copyback fields to CachedOffsets - Compute these flags in ensure_offsets_cached() using binding type/access - Simplify write_shader_cursor_with_interop() to use pre-computed flags - Apply same optimization to gradient copy-back (now checks AccessType) The copy-back logic remains the same: - Read-only simple types (scalar/vector/matrix): no copy-back - Writable outputs or tensor types: copy-back as needed - Gradients: copy-back only when access is write/readwrite Co-authored-by: Cursor <cursoragent@cursor.com>
The struct now contains more than just shader offsets - it also holds copy-back decision flags. Rename to better reflect its purpose: - CachedOffsets → CachedBindingInfo - ensure_offsets_cached → ensure_binding_info_cached - m_cached_offsets → m_cached_binding_info - extract_offsets → extract_binding_info Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify that copy-back decisions are made at cache time in C++ (ensure_binding_info_cached) based on Slang parameter type and access mode, not just the writable flag. Co-authored-by: Cursor <cursoragent@cursor.com>
Note that these flags are only used by NativeTorchTensorMarshall; NativeTensorMarshall leaves them as default (false). Co-authored-by: Cursor <cursoragent@cursor.com>
Update m_cached_offsets to m_cached_binding_info in new code from main branch. Co-authored-by: Cursor <cursoragent@cursor.com>
The refactoring in 2cd616c moved copy-back decisions to cache time, but this broke gradient copy-back for raw torch.Tensor inputs. Root cause: needs_grad_copyback was computed using has_derivative(), which returns false for raw torch.Tensor inputs (they don't have d_in/d_out marshalls at construction). However, during the backward pass, gradients ARE present and need to be copied back. Fix: Use runtime has_grad check instead of cached needs_grad_copyback for gradient copy-back decisions. This matches the original behavior where gradient copy-back was determined by whether grad_value was actually present at runtime. This fixes test_tensor_interfaces and test_tensor_generic failing on Vulkan/D3D12 with all-zero gradient outputs. Co-authored-by: Cursor <cursoragent@cursor.com>
This field was never used - gradient copy-back decisions are made at runtime using the has_grad parameter, not cached at bind time. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Handle backward pass when primal tensor is None (output slots in autograd backward). Create a zeroed interop buffer instead of crashing, and skip primal copy-back when there is no source tensor. - Fix broadcast stride zeroing order: apply after make_contiguous_strides so broadcast dimensions correctly use stride 0 for interop buffers. Also apply to the direct CUDA pointer path. Co-authored-by: Cursor <cursoragent@cursor.com>
…ternal memory cuExternalMemoryGetMappedBuffer returns a CUdeviceptr that must be freed with cuMemFree before calling cuDestroyExternalMemory. Without this, the CUDA driver keeps the underlying allocation alive, leaking ~64KB+ per interop buffer and eventually hitting VK_ERROR_OUT_OF_DEVICE_MEMORY. Co-authored-by: Cursor <cursoragent@cursor.com>
Tests exercise realistic ML/optimization workflows using PyTorch autograd, mirroring patterns from the slang-torch examples: - Polynomial optimization (cubic polynomial fitting with Adam) - Bezier curve fitting (control point optimization) - Two-layer MLP optimization (chained linear_transform -> relu4 -> dot4) - Multi-output optimization (2D vector output) - Gradient correctness with broadcast parameters - Multiple backward passes (no state leak) - Interleaved slangpy + pure PyTorch optimization Each test validates convergence on both CUDA and Vulkan backends. Co-authored-by: Cursor <cursoragent@cursor.com>
…ang-torch ports The original slang-torch examples use DiffTensorView, [CUDAKernel], [AutoPyBindCUDA], and manual torch.autograd.Function — none of which are used here. Reframe documentation to be explicit that these tests validate slangpy's PyTorch autograd integration for the same categories of workload, not ports of the original code. Reference shader-slang#740 and shader-slang#768 as prerequisites for true parity tests. Co-authored-by: Cursor <cursoragent@cursor.com>
Better reflects the file's contents: PyTorch autograd integration tests exercising workflow patterns (optimizer loops, gradient accumulation, chained calls), not end-to-end parity tests. Co-authored-by: Cursor <cursoragent@cursor.com>
ccummingsNV
left a comment
There was a problem hiding this comment.
Nice catch on the memory leak.
I think the binding is still fragile, and needs to piggy back off what the system already decided rather than attempt to re-calculate writability. The shader cursor already has enough information to say exactly whether the tensor is writable.
We also need to make that memset async if it is to be used in a performant way.
| // counter via copy_(), breaking autograd's version tracking. | ||
| // Writable outputs MUST copy back to return results. | ||
| // | ||
| // For tensor types (Tensor, RWTensor, etc.), copy back if marshall is writable. |
There was a problem hiding this comment.
Good that it's being cached, but I honestly think the solution here, as I said below, is to check the type of the tensor being bound to, not attempt to infer writability from the vector type. To clarify:
- vector type == type being passed to the user's function
- slang type == the uniform type stored in the generated kernel's call data, which this marshall represents (this is what the cursor is pointing at)
Python side, a lot of logic goes into working out whether the slang type should be Tensor, WTensor, RWTensor, etc etc. This is the 'ground truth' wtih regards to whether copying is needed. A read-only tensor uniform does not need copy-back. A write-only tensor uniform technically does not need initial copy.
The same principle applies to differentiable tensors, though is more complex:
- DiffTensor == read only primal, writable gradient (grad out)
- WDiffTensor == writable primal, readable gradient (grad in)
- RWDiffTensor == rw primal, rw gradient (grad out + grad in)
So by trying to work things out from the vector type here, you're effectively trying to replicate logic that already happened python side. Even if you get it right for now, it'll mean 2 places this logic needs to be maintained.
| }); | ||
| void* cuda_ptr = interop_buffer->cuda_memory(); | ||
| if (cuda_ptr && buffer_size > 0) | ||
| cuda::memset_device(static_cast<uint8_t*>(cuda_ptr), 0, buffer_size); |
There was a problem hiding this comment.
This is a none-async call, so would block the full gfx pipeline on execution. It should utilize a cuMemSetAsync, and provide the stream this operation is being run on.
| // This handles non-contiguous tensors via PyTorch's copy mechanism | ||
| // copy_to_buffer() now throws on error with detailed message | ||
| if (info.numel > 0 && info.data_ptr != nullptr) { | ||
| TorchBridge::instance().copy_to_buffer(tensor_value, interop_buffer->cuda_memory(), buffer_size); |
There was a problem hiding this comment.
I know you didn't write this, but just noted - for related reasons to the cuda::memset_device point, this could cause a bug. If the copy_to_buffer I wrote isn't running on the correct cuda stream, we could end up with the copy occuring after the actual dispatch.
Resolve conflict in slangpytorchtensor.cpp: integrate TensorView and DiffTensorView support (shader-slang#775) with CachedBindingInfo naming. Fix stale m_cached_offsets reference in slangpytensor.cpp from the auto-merge of the TensorView code. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the vector_type kind + AccessType approach for determining copy-back with the Slang uniform type name from the shader cursor. The Python layer already determines the concrete Slang tensor type (Tensor/WTensor/RWTensor/DiffTensor/WDiffTensor/RWDiffTensor) — this is the ground truth for writability. Also cache gradient copy-back at bind time using the same approach: DiffTensor → read primal, write grad → copy back grad WDiffTensor → write primal, read grad → no grad copy-back RWDiffTensor → rw primal, rw grad → copy back grad This avoids maintaining writability logic in two places and makes the copy-back decision robust against future tensor type changes. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace synchronous cuda::memset_device (cuMemsetD8) with async cuda::memset_device_async (cuMemsetD8Async) when zeroing interop buffers for backward pass output slots. The synchronous version blocks the host until completion, stalling the GPU pipeline. The async version uses the default CUDA stream (stream 0), which is ordered with respect to all other operations and does not block the host. Co-authored-by: Cursor <cursoragent@cursor.com>
📝 WalkthroughWalkthroughAdds extensive PyTorch gradient-parity and autograd workflow tests for SlangPy, refactors tensor marshalling to cache binding info (including copy-back flags) instead of offsets, exposes a CUDA-stream-aware async memset, and threads CUDA stream into SlangPy CallContext. Changes
Sequence Diagram(s)sequenceDiagram
participant Py as PyTorch (Python)
participant Bind as SlangPy Binding (C++)
participant Marshall as Tensor Marshall
participant GPU as GPU / CUDA
participant Torch as Torch Tensor Memory
Py->>Bind: call slang kernel (includes CUDA stream)
Bind->>Marshall: ensure_binding_info_cached(cursor, binding)
Marshall-->>Bind: return CachedBindingInfo (offsets, writable flags)
Bind->>GPU: dispatch kernel (using provided stream)
GPU-->>Bind: kernel finishes (async stream)
Bind->>Marshall: post-dispatch copy-back decision (needs_primal_copyback / needs_grad_copyback)
Marshall->>Torch: perform copy-back to torch tensor memory if needed
Bind-->>Py: return result tensor (views/contiguity preserved)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py (1)
170-184: Consider renaming unused loop variableepochto_epoch.The static analysis correctly identifies that
epochis not used within the loop body. While this doesn't affect functionality, renaming to_epochwould silence the linter and signal intent. This applies to lines 170, 254, 334, 403, and 577.Example fix (applies to all similar loops)
- for epoch in range(300): + for _epoch in range(300):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py` around lines 170 - 184, The loop variable "epoch" is unused in multiple training loops (e.g., the loop that calls optimizer.zero_grad(), module.cubic_poly(...), loss.backward(), optimizer.step(), and assert_loss_decreased) — rename "epoch" to "_epoch" in each of these for-loops (occurrences you noted) to satisfy the linter and communicate intent; update all for epoch in range(...) declarations to for _epoch in range(...) in the test_torch_autograd_workflows.py file where the training loops are defined.slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py (1)
444-444: Refactor lambda to function definition.The linter correctly flags this lambda assignment. Using
defis more readable and allows for docstrings if needed.Proposed fix
- pytorch_mse = lambda output, tgt: nn.functional.mse_loss(output, tgt) + def pytorch_mse(output: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor: + return nn.functional.mse_loss(output, tgt)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py` at line 444, The assignment using a lambda named pytorch_mse should be replaced by a normal function definition: find the lambda assignment "pytorch_mse = lambda output, tgt: nn.functional.mse_loss(output, tgt)" and refactor it into a def pytorch_mse(output, tgt): that returns nn.functional.mse_loss(output, tgt) (optionally add a one-line docstring); update any references to pytorch_mse unchanged since the name stays the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py`:
- Line 444: The assignment using a lambda named pytorch_mse should be replaced
by a normal function definition: find the lambda assignment "pytorch_mse =
lambda output, tgt: nn.functional.mse_loss(output, tgt)" and refactor it into a
def pytorch_mse(output, tgt): that returns nn.functional.mse_loss(output, tgt)
(optionally add a one-line docstring); update any references to pytorch_mse
unchanged since the name stays the same.
In `@slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py`:
- Around line 170-184: The loop variable "epoch" is unused in multiple training
loops (e.g., the loop that calls optimizer.zero_grad(), module.cubic_poly(...),
loss.backward(), optimizer.step(), and assert_loss_decreased) — rename "epoch"
to "_epoch" in each of these for-loops (occurrences you noted) to satisfy the
linter and communicate intent; update all for epoch in range(...) declarations
to for _epoch in range(...) in the test_torch_autograd_workflows.py file where
the training loops are defined.
memset_device_async was using stream 0 (default), which can race with work on the PyTorch CUDA stream. Thread the stream from NativeCallRuntimeOptions through CallContext so interop buffer operations use the same stream as the dispatch. Co-authored-by: Cursor <cursoragent@cursor.com>
Fixes #733, #387
Bugfixes
CUdeviceptrwithcuMemFreebefore destroying external memory, preventing OOM after many Vulkan dispatches.None(backward output slot), instead of crashing.W/RWprefix), cached at bind time. Prevents spuriouscopy_()on read-only inputs that break autograd's version tracking.NativeCallRuntimeOptionsthroughCallContextsomemset_device_asyncand dispatch use the correct stream instead of stream 0.cuMemsetD8withcuMemsetD8Asyncfor zeroed interop buffers.Refactoring
CachedOffsets→CachedBindingInfo; now holds copy-back decision flags alongside shader offsets.CallContextconstructor accepts an optionalNativeHandle cuda_stream.New tests