Add PyTorch autograd tests and fix interop/copy-back bugs by jhelferty-nv · Pull Request #781 · shader-slang/slangpy

jhelferty-nv · 2026-02-05T21:20:39Z

Bugfixes

CUDA interop memory leak: free mapped CUdeviceptr with cuMemFree before destroying external memory, preventing OOM after many Vulkan dispatches.
Interop backward crash: create zeroed interop buffer when primal tensor is None (backward output slot), instead of crashing.
Broadcast stride zeroing: apply after contiguous stride computation for interop buffers so broadcast dimensions correctly use stride 0.
Copy-back correctness: decide whether to copy interop buffers back based on the Slang uniform type name (W/RW prefix), cached at bind time. Prevents spurious copy_() on read-only inputs that break autograd's version tracking.
CUDA stream synchronization: thread the CUDA stream from NativeCallRuntimeOptions through CallContext so memset_device_async and dispatch use the correct stream instead of stream 0.
Async memset: replace synchronous cuMemsetD8 with cuMemsetD8Async for zeroed interop buffers.

Refactoring

Rename CachedOffsets → CachedBindingInfo; now holds copy-back decision flags alongside shader offsets.
CallContext constructor accepts an optional NativeHandle cuda_stream.

New tests

test_pytorch_gradient_parity.py: gradient correctness across activations, losses, slicing, transposes, strided ops, and copy-back semantics.
test_torch_autograd_workflows.py: optimizer loops, chained kernels, gradient accumulation, mixed SlangPy + PyTorch graphs.

…ion increments Addresses a test failure found in shader-slang#733 SlangPy was copying data back to ALL PyTorch tensors after kernel execution, including read-only inputs. This copy used PyTorch's copy_() method which increments the tensor's _version counter, breaking autograd's version tracking and causing RuntimeError during backward pass for interleaved SlangPy/PyTorch operations. The fix checks the binding's access type before copying back - only tensors with write or readwrite access get data copied back, not read-only inputs. Co-authored-by: Cursor <cursoragent@cursor.com>

ccummingsNV

Good set of tests, though obviously they need to pass before we can merge.

We should add tests that pass full tensors in as well as tests that just take scalar variables, as the interop logic is subtly different.

It would also be nice to see if we can get rid of the need to wrap slangpy functions in torch modules to work. Our goal is for a user to integrate seemlessly - creating custom wrapper classes isn't ideal.

ccummingsNV · 2026-02-09T11:54:44Z

src/slangpy_ext/utils/slangpytorchtensor.cpp

+    AccessType primal_access = binding->access().first;
+    bool primal_is_writable = (primal_access == AccessType::write || primal_access == AccessType::readwrite);
+    bool needs_primal_copyback = m_writable && primal_is_writable && primal_info.numel > 0 && primal_interop_buffer;
    bool needs_grad_copyback = has_grad && grad_info.numel > 0 && grad_interop_buffer;


I'm not convinced this logic is correct. It works for a scalar during broadcasting (i.e. pass tensor to float), but what if it's passing a tensor to a tensor. the access type would probably be 'read' (its just reading the tensor info) even though the tensor is written to.

Confirmed, I'll push a fix and tests for this.

I'm not happy with this - it's getting too complex, and will slow down the dispatch. The simplest is just to check if it is writable, though as you point out that could result in redundant copying.

Probably the correct logic, if we want to do it, would be to adjust the logic that caches offsets to look at the slang uniform type that is being bound to, and identify it is a writable tensor type (by name). This could work for gradient based tensors too. i.e. in that situation we're saying:

if the user has explicitly written a function that takes a writable tensor, assume it is being written to

otherwise, if the user has written a scalar function we're vectorizing across, and it has an 'out/inout' param, assume it is being written to

Exactly the same principle could be applied to the gradients, making that logic more robust/optimal as well.

As that code only happens once, it would not slow down dispatches. (though we'd want to rename some bits so it didn't claim to just be caching offsets when it in fact was caching more than that!).

Ultimately the actual decision making for this is in tensorcommon.py, where we generate the call data for the tensor. This is where we decide what actual tensor type the generated kernel uses. Another option would be to write some boolean value 'needs copying' inside gen_calldata, but that feels wrong to me - gen_calldata by implication is immutable, and many design decisions are based on the idea that a marshall is complete from the moment it's created.

We probably also need to apply this logic to the normal tensor right?

Okay, I've gone ahead and moved the logic to cache-time. We do need this fixed, because more complex cases where a Sequential jumps back and forth between SlangPy and PyTorch will generate an assert if we end up copying a read-only input tensor. e.g.,

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [32, 16]], which is output 0 of TanhBackward0, is at version 2; expected version 0 instead.

The cache-time logic uses TypeReflection::Kind instead of the type name ("RWTensor" or "WTensor").. hopefully this aligns with what you were thinking?

As for normal tensors, IIUC, the normal SlangPy Tensor doesn't need copy-back because the shader writes directly to the tensor's GPU buffer (via device_address()). The interop buffer / copy-back mechanism is specifically for PyTorch tensors on non-CUDA backends, where SlangPy can't write directly to PyTorch's CUDA memory from a Vulkan/D3D12 shader.

Or as Claude puts it, the asymmetry is:

SlangPy buffer on Vulkan/D3D12: The buffer is native to Vulkan/D3D12. The shader writes directly to it. cuda_memory() provides a CUDA view of this same memory for interop (if needed).

PyTorch tensor on Vulkan/D3D12: The tensor is native to CUDA (PyTorch created it). Vulkan/D3D12 shaders can't write to CUDA-owned memory. So we create a SlangPy interop buffer (which Vulkan/D3D12 can access), run the shader, then copy the results back to PyTorch via CUDA.

slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py

Tests for copyback logic in PyTorch integration.

…eters This fixes an issue with commit 6edd1b6 which used AccessType to prevent spurious copy-back. That approach was incomplete: - AccessType describes the *parameter binding* (in/out/inout modifiers) - It does NOT describe whether the underlying *tensor data* is writable The original fix worked for scalar broadcast (tensor → float) but broke tensor parameters (tensor → RWTensor<T,N>) because: - RWTensor parameter: binding is "in RWTensor<...>" → AccessType::read - But the tensor DATA is writable and must be copied back The new approach checks the *target type* instead: - Simple types (scalar/vector/matrix): Broadcast case, tensor is read-only → Skip copy-back (prevents autograd _version increment) - Complex types (struct/resource): May include tensor types → Use m_writable to determine copy-back (preserves output data) This correctly handles both cases: 1. Tensor → float: No copy-back, autograd works 2. Tensor → RWTensor: Copy-back occurs, output data preserved Adds tests for both cases to prevent regression. Co-authored-by: Cursor <cursoragent@cursor.com>

The previous fix (b590ae3) incorrectly skipped copy-back for ALL tensors bound to simple types (scalar/vector/matrix), including outputs. For scalar functions like `float slang_relu(float x)`: - INPUT tensor → float: Read-only broadcast, skip copy-back (correct) - OUTPUT tensor → float: Contains results, MUST copy back (was broken) The fix now checks BOTH conditions to skip copy-back: 1. Target type is simple (scalar/vector/matrix) 2. Access type is read-only (not write or readwrite) This ensures outputs (which have write access) are always copied back, while inputs (read access) skip the copy-back that would break autograd. Co-authored-by: Cursor <cursoragent@cursor.com>

- Remove numbered test case labels (Case 1, 2a, 2b) - Remove references to specific reviewers in comments - Improve section header clarity Co-authored-by: Cursor <cursoragent@cursor.com>

ccummingsNV

Hi. This is great work - sorry to be picky on the binding - it's just important we absolutely nail that particular logic, both in terms of correctness and performance.

I've detailed what I think is the only robust path for us now - ultimately looking at the type that is being bound to, which we conveniently do have access to when calculating offsets. I think what the actual slang side type that we're binding the tensor to (note not the same as the argument type of the user's function) is the concrete answer to when/whether copies are needed.

slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py

ccummingsNV · 2026-02-10T08:49:01Z

src/slangpy_ext/utils/slangpytorchtensor.cpp

+    AccessType primal_access = binding->access().first;
+    bool primal_is_writable = (primal_access == AccessType::write || primal_access == AccessType::readwrite);
+    bool needs_primal_copyback = m_writable && primal_is_writable && primal_info.numel > 0 && primal_interop_buffer;
    bool needs_grad_copyback = has_grad && grad_info.numel > 0 && grad_interop_buffer;


I'm not happy with this - it's getting too complex, and will slow down the dispatch. The simplest is just to check if it is writable, though as you point out that could result in redundant copying.

Probably the correct logic, if we want to do it, would be to adjust the logic that caches offsets to look at the slang uniform type that is being bound to, and identify it is a writable tensor type (by name). This could work for gradient based tensors too. i.e. in that situation we're saying:

if the user has explicitly written a function that takes a writable tensor, assume it is being written to

otherwise, if the user has written a scalar function we're vectorizing across, and it has an 'out/inout' param, assume it is being written to

Exactly the same principle could be applied to the gradients, making that logic more robust/optimal as well.

As that code only happens once, it would not slow down dispatches. (though we'd want to rename some bits so it didn't claim to just be caching offsets when it in fact was caching more than that!).

Ultimately the actual decision making for this is in tensorcommon.py, where we generate the call data for the tensor. This is where we decide what actual tensor type the generated kernel uses. Another option would be to write some boolean value 'needs copying' inside gen_calldata, but that feels wrong to me - gen_calldata by implication is immutable, and many design decisions are based on the idea that a marshall is complete from the moment it's created.

We probably also need to apply this logic to the normal tensor right?

Add tests to verify gradient interop buffer copy-back works correctly: - test_gradient_copyback_single_op: Single input gradient verification - test_gradient_copyback_multiple_inputs: Multiple inputs with gradient values These tests will serve as regression tests for the upcoming refactoring that moves copy-back decisions from dispatch time to cache time. Co-authored-by: Cursor <cursoragent@cursor.com>

Move the torch tensor copy-back decision from runtime dispatch to the offset caching phase. This avoids expensive runtime type reflection (vector_type()->type_reflection()->kind()) on every dispatch. Changes: - Add needs_primal_copyback and needs_grad_copyback fields to CachedOffsets - Compute these flags in ensure_offsets_cached() using binding type/access - Simplify write_shader_cursor_with_interop() to use pre-computed flags - Apply same optimization to gradient copy-back (now checks AccessType) The copy-back logic remains the same: - Read-only simple types (scalar/vector/matrix): no copy-back - Writable outputs or tensor types: copy-back as needed - Gradients: copy-back only when access is write/readwrite Co-authored-by: Cursor <cursoragent@cursor.com>

The struct now contains more than just shader offsets - it also holds copy-back decision flags. Rename to better reflect its purpose: - CachedOffsets → CachedBindingInfo - ensure_offsets_cached → ensure_binding_info_cached - m_cached_offsets → m_cached_binding_info - extract_offsets → extract_binding_info Co-authored-by: Cursor <cursoragent@cursor.com>

Clarify that copy-back decisions are made at cache time in C++ (ensure_binding_info_cached) based on Slang parameter type and access mode, not just the writable flag. Co-authored-by: Cursor <cursoragent@cursor.com>

Note that these flags are only used by NativeTorchTensorMarshall; NativeTensorMarshall leaves them as default (false). Co-authored-by: Cursor <cursoragent@cursor.com>

Update m_cached_offsets to m_cached_binding_info in new code from main branch. Co-authored-by: Cursor <cursoragent@cursor.com>

The refactoring in 2cd616c moved copy-back decisions to cache time, but this broke gradient copy-back for raw torch.Tensor inputs. Root cause: needs_grad_copyback was computed using has_derivative(), which returns false for raw torch.Tensor inputs (they don't have d_in/d_out marshalls at construction). However, during the backward pass, gradients ARE present and need to be copied back. Fix: Use runtime has_grad check instead of cached needs_grad_copyback for gradient copy-back decisions. This matches the original behavior where gradient copy-back was determined by whether grad_value was actually present at runtime. This fixes test_tensor_interfaces and test_tensor_generic failing on Vulkan/D3D12 with all-zero gradient outputs. Co-authored-by: Cursor <cursoragent@cursor.com>

This field was never used - gradient copy-back decisions are made at runtime using the has_grad parameter, not cached at bind time. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

- Handle backward pass when primal tensor is None (output slots in autograd backward). Create a zeroed interop buffer instead of crashing, and skip primal copy-back when there is no source tensor. - Fix broadcast stride zeroing order: apply after make_contiguous_strides so broadcast dimensions correctly use stride 0 for interop buffers. Also apply to the direct CUDA pointer path. Co-authored-by: Cursor <cursoragent@cursor.com>

…ternal memory cuExternalMemoryGetMappedBuffer returns a CUdeviceptr that must be freed with cuMemFree before calling cuDestroyExternalMemory. Without this, the CUDA driver keeps the underlying allocation alive, leaking ~64KB+ per interop buffer and eventually hitting VK_ERROR_OUT_OF_DEVICE_MEMORY. Co-authored-by: Cursor <cursoragent@cursor.com>

Tests exercise realistic ML/optimization workflows using PyTorch autograd, mirroring patterns from the slang-torch examples: - Polynomial optimization (cubic polynomial fitting with Adam) - Bezier curve fitting (control point optimization) - Two-layer MLP optimization (chained linear_transform -> relu4 -> dot4) - Multi-output optimization (2D vector output) - Gradient correctness with broadcast parameters - Multiple backward passes (no state leak) - Interleaved slangpy + pure PyTorch optimization Each test validates convergence on both CUDA and Vulkan backends. Co-authored-by: Cursor <cursoragent@cursor.com>

…ang-torch ports The original slang-torch examples use DiffTensorView, [CUDAKernel], [AutoPyBindCUDA], and manual torch.autograd.Function — none of which are used here. Reframe documentation to be explicit that these tests validate slangpy's PyTorch autograd integration for the same categories of workload, not ports of the original code. Reference shader-slang#740 and shader-slang#768 as prerequisites for true parity tests. Co-authored-by: Cursor <cursoragent@cursor.com>

Better reflects the file's contents: PyTorch autograd integration tests exercising workflow patterns (optimizer loops, gradient accumulation, chained calls), not end-to-end parity tests. Co-authored-by: Cursor <cursoragent@cursor.com>

ccummingsNV

Nice catch on the memory leak.

I think the binding is still fragile, and needs to piggy back off what the system already decided rather than attempt to re-calculate writability. The shader cursor already has enough information to say exactly whether the tensor is writable.

We also need to make that memset async if it is to be used in a performant way.

ccummingsNV · 2026-02-16T10:01:06Z

src/slangpy_ext/utils/slangpytorchtensor.cpp

+        // counter via copy_(), breaking autograd's version tracking.
+        // Writable outputs MUST copy back to return results.
+        //
+        // For tensor types (Tensor, RWTensor, etc.), copy back if marshall is writable.


Good that it's being cached, but I honestly think the solution here, as I said below, is to check the type of the tensor being bound to, not attempt to infer writability from the vector type. To clarify:

vector type == type being passed to the user's function

slang type == the uniform type stored in the generated kernel's call data, which this marshall represents (this is what the cursor is pointing at)

Python side, a lot of logic goes into working out whether the slang type should be Tensor, WTensor, RWTensor, etc etc. This is the 'ground truth' wtih regards to whether copying is needed. A read-only tensor uniform does not need copy-back. A write-only tensor uniform technically does not need initial copy.

The same principle applies to differentiable tensors, though is more complex:

DiffTensor == read only primal, writable gradient (grad out)

WDiffTensor == writable primal, readable gradient (grad in)

RWDiffTensor == rw primal, rw gradient (grad out + grad in)

So by trying to work things out from the vector type here, you're effectively trying to replicate logic that already happened python side. Even if you get it right for now, it'll mean 2 places this logic needs to be maintained.

ccummingsNV · 2026-02-16T10:03:18Z

src/slangpy_ext/utils/slangpytorchtensor.cpp

+        });
+        void* cuda_ptr = interop_buffer->cuda_memory();
+        if (cuda_ptr && buffer_size > 0)
+            cuda::memset_device(static_cast<uint8_t*>(cuda_ptr), 0, buffer_size);


This is a none-async call, so would block the full gfx pipeline on execution. It should utilize a cuMemSetAsync, and provide the stream this operation is being run on.

ccummingsNV · 2026-02-16T10:05:02Z

src/slangpy_ext/utils/slangpytorchtensor.cpp

-        // This handles non-contiguous tensors via PyTorch's copy mechanism
-        // copy_to_buffer() now throws on error with detailed message
        if (info.numel > 0 && info.data_ptr != nullptr) {
            TorchBridge::instance().copy_to_buffer(tensor_value, interop_buffer->cuda_memory(), buffer_size);


I know you didn't write this, but just noted - for related reasons to the cuda::memset_device point, this could cause a bug. If the copy_to_buffer I wrote isn't running on the correct cuda stream, we could end up with the copy occuring after the actual dispatch.

Resolve conflict in slangpytorchtensor.cpp: integrate TensorView and DiffTensorView support (shader-slang#775) with CachedBindingInfo naming. Fix stale m_cached_offsets reference in slangpytensor.cpp from the auto-merge of the TensorView code. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace the vector_type kind + AccessType approach for determining copy-back with the Slang uniform type name from the shader cursor. The Python layer already determines the concrete Slang tensor type (Tensor/WTensor/RWTensor/DiffTensor/WDiffTensor/RWDiffTensor) — this is the ground truth for writability. Also cache gradient copy-back at bind time using the same approach: DiffTensor → read primal, write grad → copy back grad WDiffTensor → write primal, read grad → no grad copy-back RWDiffTensor → rw primal, rw grad → copy back grad This avoids maintaining writability logic in two places and makes the copy-back decision robust against future tensor type changes. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace synchronous cuda::memset_device (cuMemsetD8) with async cuda::memset_device_async (cuMemsetD8Async) when zeroing interop buffers for backward pass output slots. The synchronous version blocks the host until completion, stalling the GPU pipeline. The async version uses the default CUDA stream (stream 0), which is ordered with respect to all other operations and does not block the host. Co-authored-by: Cursor <cursoragent@cursor.com>

coderabbitai · 2026-02-18T22:21:56Z

📝 Walkthrough

Walkthrough

Adds extensive PyTorch gradient-parity and autograd workflow tests for SlangPy, refactors tensor marshalling to cache binding info (including copy-back flags) instead of offsets, exposes a CUDA-stream-aware async memset, and threads CUDA stream into SlangPy CallContext.

Changes

Cohort / File(s)	Summary
PyTorch Gradient Parity Testing `slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py`	New comprehensive test suite comparing gradients between native PyTorch ops and SlangPy-wrapped kernels across activations, losses, slicing/strides, transposes, multi-op sequences, and copy-back scenarios. Adds test utilities and SlangPy kernel snippets.
PyTorch Autograd Workflow Testing `slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py`	New end-to-end autograd/optimization tests (polynomial, Bezier fitting, two-layer MLP, multi-output, broadcast-grad checks, interleaved SlangPy+PyTorch flows) exercising optimizer convergence and gradient correctness.
Tensor Marshalling Refactor (binding info + copy-back flags) `src/slangpy_ext/utils/slangpytensor.h`, `src/slangpy_ext/utils/slangpytensor.cpp`, `src/slangpy_ext/utils/slangpytorchtensor.h`, `src/slangpy_ext/utils/slangpytorchtensor.cpp`, `src/slangpy_ext/utils/slangpytensor.cpp`	Renamed CachedOffsets → CachedBindingInfo, added needs_primal_copyback and needs_grad_copyback flags, replaced ensure_offsets_cached → ensure_binding_info_cached, and updated all marshalling/read/write paths to use binding-info fields and runtime-deduced writability.
CallContext CUDA stream plumbing `src/sgl/utils/slangpy.h`, `src/slangpy_ext/utils/slangpy.cpp`	CallContext constructor now accepts a NativeHandle cuda_stream and exposes a cuda_stream() accessor; SlangPy exec path extracts and passes CUDA stream into CallContext and Python binding updated accordingly.
CUDA async memset & ExternalMemory cleanup `src/sgl/device/cuda_utils.h`, `src/sgl/device/cuda_utils.cpp`	Added memset_device_async(dst, value, count, CUstream) for async device clears and freed mapped device memory in ExternalMemory destructor to prevent leaks.
Torch marshalling comment only `slangpy/torchintegration/torchtensormarshall.py`	Added clarifying comments about writable flag semantics and that copy-back decisions are made in C++ (no functional change).

Sequence Diagram(s)

sequenceDiagram
  participant Py as PyTorch (Python)
  participant Bind as SlangPy Binding (C++)
  participant Marshall as Tensor Marshall
  participant GPU as GPU / CUDA
  participant Torch as Torch Tensor Memory

  Py->>Bind: call slang kernel (includes CUDA stream)
  Bind->>Marshall: ensure_binding_info_cached(cursor, binding)
  Marshall-->>Bind: return CachedBindingInfo (offsets, writable flags)
  Bind->>GPU: dispatch kernel (using provided stream)
  GPU-->>Bind: kernel finishes (async stream)
  Bind->>Marshall: post-dispatch copy-back decision (needs_primal_copyback / needs_grad_copyback)
  Marshall->>Torch: perform copy-back to torch tensor memory if needed
  Bind-->>Py: return result tensor (views/contiguity preserved)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Add DiffTensorView<T> and TensorView<T> support in slangpy #775 — Related tensor marshalling refactor touching CachedOffsets → CachedBindingInfo and marshalling logic for TensorView/DiffTensorView (strong overlap in binding-info/cache changes).

Suggested reviewers

ccummingsNV

Poem

"🐰 I hopped through offsets, binding info in tow,
Gradients matched, slice and stride in a row.
CUDA streams hum, async clears on the fly,
Copy-backs now known, no surprises nearby.
Cheers — a rabbit's nibble on code we bestow!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 49.15% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the primary changes: adding PyTorch autograd tests and fixing interop/copy-back bugs, which aligns with the comprehensive changeset across test files and system-level interop fixes.
Linked Issues check	✅ Passed	The PR addresses issue `#733` by adding comprehensive PyTorch tests (test_torch_autograd_workflows.py, test_pytorch_gradient_parity.py) that exercise complex tensor usage, optimizer loops, gradient accumulation, and mixed slangpy+PyTorch operations as explicitly requested.
Out of Scope Changes check	✅ Passed	All changes are directly related to the PR objectives: new test modules, interop buffer fixes (cuda_utils, slangpytensor refactoring), copy-back decision caching, and stream threading through CallContext—all supporting the testing and bugfix goals.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (2)

slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py (1)
170-184: Consider renaming unused loop variable epoch to _epoch.

The static analysis correctly identifies that epoch is not used within the loop body. While this doesn't affect functionality, renaming to _epoch would silence the linter and signal intent. This applies to lines 170, 254, 334, 403, and 577.
Example fix (applies to all similar loops)
-    for epoch in range(300):
+    for _epoch in range(300):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py` around lines
170 - 184, The loop variable "epoch" is unused in multiple training loops (e.g.,
the loop that calls optimizer.zero_grad(), module.cubic_poly(...),
loss.backward(), optimizer.step(), and assert_loss_decreased) — rename "epoch"
to "_epoch" in each of these for-loops (occurrences you noted) to satisfy the
linter and communicate intent; update all for epoch in range(...) declarations
to for _epoch in range(...) in the test_torch_autograd_workflows.py file where
the training loops are defined.
slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py (1)
444-444: Refactor lambda to function definition.

The linter correctly flags this lambda assignment. Using def is more readable and allows for docstrings if needed.
Proposed fix
-    pytorch_mse = lambda output, tgt: nn.functional.mse_loss(output, tgt)
+    def pytorch_mse(output: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor:
+        return nn.functional.mse_loss(output, tgt)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py` at line 444, The
assignment using a lambda named pytorch_mse should be replaced by a normal
function definition: find the lambda assignment "pytorch_mse = lambda output,
tgt: nn.functional.mse_loss(output, tgt)" and refactor it into a def
pytorch_mse(output, tgt): that returns nn.functional.mse_loss(output, tgt)
(optionally add a one-line docstring); update any references to pytorch_mse
unchanged since the name stays the same.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py`:
- Line 444: The assignment using a lambda named pytorch_mse should be replaced
by a normal function definition: find the lambda assignment "pytorch_mse =
lambda output, tgt: nn.functional.mse_loss(output, tgt)" and refactor it into a
def pytorch_mse(output, tgt): that returns nn.functional.mse_loss(output, tgt)
(optionally add a one-line docstring); update any references to pytorch_mse
unchanged since the name stays the same.

In `@slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py`:
- Around line 170-184: The loop variable "epoch" is unused in multiple training
loops (e.g., the loop that calls optimizer.zero_grad(), module.cubic_poly(...),
loss.backward(), optimizer.step(), and assert_loss_decreased) — rename "epoch"
to "_epoch" in each of these for-loops (occurrences you noted) to satisfy the
linter and communicate intent; update all for epoch in range(...) declarations
to for _epoch in range(...) in the test_torch_autograd_workflows.py file where
the training loops are defined.

memset_device_async was using stream 0 (default), which can race with work on the PyTorch CUDA stream. Thread the stream from NativeCallRuntimeOptions through CallContext so interop buffer operations use the same stream as the dispatch. Co-authored-by: Cursor <cursoragent@cursor.com>

jhelferty-nv and others added 8 commits February 5, 2026 16:15

test using slangpy in pytorch sequence

c71e591

loss

bc3e4af

Activations

ca43731

interleaving tests

132608a

activations with min/max

6531182

Strided slice with different ops

85bbe15

minor tidy

cae7267

jhelferty-nv self-assigned this Feb 5, 2026

jhelferty-nv requested a review from a team as a code owner February 5, 2026 21:20

ccummingsNV requested changes Feb 9, 2026

View reviewed changes

jhelferty-nv and others added 5 commits February 9, 2026 11:17

more test coverage for copyback

f9b7910

Tests for copyback logic in PyTorch integration.

Clean up test docstrings and comments

d5bd496

- Remove numbered test case labels (Case 1, 2a, 2b) - Remove references to specific reviewers in comments - Improve section header clarity Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into add-pytorch-tests

719233b

ccummingsNV requested changes Feb 10, 2026

View reviewed changes

jhelferty-nv mentioned this pull request Feb 10, 2026

Pytorch slice causes runtime error #387

Open

jhelferty-nv and others added 5 commits February 10, 2026 15:28

Update comments to reflect copy-back decision logic

fe5ef48

Clarify that copy-back decisions are made at cache time in C++ (ensure_binding_info_cached) based on Slang parameter type and access mode, not just the writable flag. Co-authored-by: Cursor <cursoragent@cursor.com>

Add clarifying comment for copy-back flags in CachedBindingInfo

777fd6a

Note that these flags are only used by NativeTorchTensorMarshall; NativeTensorMarshall leaves them as default (false). Co-authored-by: Cursor <cursoragent@cursor.com>

jhelferty-nv mentioned this pull request Feb 11, 2026

Enable using slangpy functions as PyTorch nn.Module #790

Open

jhelferty-nv and others added 6 commits February 11, 2026 12:54

Merge branch 'main' into add-pytorch-tests

f10146b

Fix renamed variable after merge with main

bca0314

Update m_cached_offsets to m_cached_binding_info in new code from main branch. Co-authored-by: Cursor <cursoragent@cursor.com>

Remove unused needs_grad_copyback field from CachedBindingInfo

0d31d28

This field was never used - gradient copy-back decisions are made at runtime using the has_grad parameter, not cached at bind time. Co-authored-by: Cursor <cursoragent@cursor.com>

Condense copy-back comments to reduce duplication

eeea630

Co-authored-by: Cursor <cursoragent@cursor.com>

jhelferty-nv and others added 5 commits February 12, 2026 17:20

Merge branch 'main' into add-pytorch-tests

8f9060d

jhelferty-nv changed the title ~~Add PyTorch gradient parity tests and fix autograd compatibility bug~~ Add PyTorch autograd tests and fix interop/copy-back bugs Feb 12, 2026

ccummingsNV requested changes Feb 16, 2026

View reviewed changes

jhelferty-nv and others added 3 commits February 17, 2026 16:59

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

jhelferty-nv and others added 2 commits February 19, 2026 13:20

Merge branch 'main' into add-pytorch-tests

01dedcd

Conversation

jhelferty-nv commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Bugfixes

Refactoring

New tests

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ccummingsNV left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

jhelferty-nv commented Feb 5, 2026 •

edited

Loading

coderabbitai bot commented Feb 18, 2026 •

edited

Loading