Skip to content

Add PyTorch autograd tests and fix interop/copy-back bugs#781

Open
jhelferty-nv wants to merge 34 commits intoshader-slang:mainfrom
jhelferty-nv:add-pytorch-tests
Open

Add PyTorch autograd tests and fix interop/copy-back bugs#781
jhelferty-nv wants to merge 34 commits intoshader-slang:mainfrom
jhelferty-nv:add-pytorch-tests

Conversation

@jhelferty-nv
Copy link
Contributor

@jhelferty-nv jhelferty-nv commented Feb 5, 2026

Fixes #733, #387

Bugfixes

  • CUDA interop memory leak: free mapped CUdeviceptr with cuMemFree before destroying external memory, preventing OOM after many Vulkan dispatches.
  • Interop backward crash: create zeroed interop buffer when primal tensor is None (backward output slot), instead of crashing.
  • Broadcast stride zeroing: apply after contiguous stride computation for interop buffers so broadcast dimensions correctly use stride 0.
  • Copy-back correctness: decide whether to copy interop buffers back based on the Slang uniform type name (W/RW prefix), cached at bind time. Prevents spurious copy_() on read-only inputs that break autograd's version tracking.
  • CUDA stream synchronization: thread the CUDA stream from NativeCallRuntimeOptions through CallContext so memset_device_async and dispatch use the correct stream instead of stream 0.
  • Async memset: replace synchronous cuMemsetD8 with cuMemsetD8Async for zeroed interop buffers.

Refactoring

  • Rename CachedOffsetsCachedBindingInfo; now holds copy-back decision flags alongside shader offsets.
  • CallContext constructor accepts an optional NativeHandle cuda_stream.

New tests

  • test_pytorch_gradient_parity.py: gradient correctness across activations, losses, slicing, transposes, strided ops, and copy-back semantics.
  • test_torch_autograd_workflows.py: optimizer loops, chained kernels, gradient accumulation, mixed SlangPy + PyTorch graphs.

jhelferty-nv and others added 8 commits February 5, 2026 16:15
…ion increments

Addresses a test failure found in shader-slang#733

SlangPy was copying data back to ALL PyTorch tensors after kernel execution,
including read-only inputs. This copy used PyTorch's copy_() method which
increments the tensor's _version counter, breaking autograd's version tracking
and causing RuntimeError during backward pass for interleaved SlangPy/PyTorch
operations.

The fix checks the binding's access type before copying back - only tensors
with write or readwrite access get data copied back, not read-only inputs.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jhelferty-nv jhelferty-nv self-assigned this Feb 5, 2026
@jhelferty-nv jhelferty-nv requested a review from a team as a code owner February 5, 2026 21:20
Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good set of tests, though obviously they need to pass before we can merge.

We should add tests that pass full tensors in as well as tests that just take scalar variables, as the interop logic is subtly different.

It would also be nice to see if we can get rid of the need to wrap slangpy functions in torch modules to work. Our goal is for a user to integrate seemlessly - creating custom wrapper classes isn't ideal.

AccessType primal_access = binding->access().first;
bool primal_is_writable = (primal_access == AccessType::write || primal_access == AccessType::readwrite);
bool needs_primal_copyback = m_writable && primal_is_writable && primal_info.numel > 0 && primal_interop_buffer;
bool needs_grad_copyback = has_grad && grad_info.numel > 0 && grad_interop_buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this logic is correct. It works for a scalar during broadcasting (i.e. pass tensor to float), but what if it's passing a tensor to a tensor. the access type would probably be 'read' (its just reading the tensor info) even though the tensor is written to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, I'll push a fix and tests for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this - it's getting too complex, and will slow down the dispatch. The simplest is just to check if it is writable, though as you point out that could result in redundant copying.

Probably the correct logic, if we want to do it, would be to adjust the logic that caches offsets to look at the slang uniform type that is being bound to, and identify it is a writable tensor type (by name). This could work for gradient based tensors too. i.e. in that situation we're saying:

  • if the user has explicitly written a function that takes a writable tensor, assume it is being written to
  • otherwise, if the user has written a scalar function we're vectorizing across, and it has an 'out/inout' param, assume it is being written to

Exactly the same principle could be applied to the gradients, making that logic more robust/optimal as well.

As that code only happens once, it would not slow down dispatches. (though we'd want to rename some bits so it didn't claim to just be caching offsets when it in fact was caching more than that!).

Ultimately the actual decision making for this is in tensorcommon.py, where we generate the call data for the tensor. This is where we decide what actual tensor type the generated kernel uses. Another option would be to write some boolean value 'needs copying' inside gen_calldata, but that feels wrong to me - gen_calldata by implication is immutable, and many design decisions are based on the idea that a marshall is complete from the moment it's created.

We probably also need to apply this logic to the normal tensor right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I've gone ahead and moved the logic to cache-time. We do need this fixed, because more complex cases where a Sequential jumps back and forth between SlangPy and PyTorch will generate an assert if we end up copying a read-only input tensor. e.g.,

RuntimeError: one of the variables needed for gradient computation has been modified 
by an inplace operation: [torch.cuda.FloatTensor [32, 16]], which is output 0 of 
TanhBackward0, is at version 2; expected version 0 instead.

The cache-time logic uses TypeReflection::Kind instead of the type name ("RWTensor" or "WTensor").. hopefully this aligns with what you were thinking?

As for normal tensors, IIUC, the normal SlangPy Tensor doesn't need copy-back because the shader writes directly to the tensor's GPU buffer (via device_address()). The interop buffer / copy-back mechanism is specifically for PyTorch tensors on non-CUDA backends, where SlangPy can't write directly to PyTorch's CUDA memory from a Vulkan/D3D12 shader.

Or as Claude puts it, the asymmetry is:

  • SlangPy buffer on Vulkan/D3D12: The buffer is native to Vulkan/D3D12. The shader writes directly to it. cuda_memory() provides a CUDA view of this same memory for interop (if needed).
  • PyTorch tensor on Vulkan/D3D12: The tensor is native to CUDA (PyTorch created it). Vulkan/D3D12 shaders can't write to CUDA-owned memory. So we create a SlangPy interop buffer (which Vulkan/D3D12 can access), run the shader, then copy the results back to PyTorch via CUDA.

jhelferty-nv and others added 5 commits February 9, 2026 11:17
Tests for copyback logic in PyTorch integration.
…eters

This fixes an issue with commit 6edd1b6 which used AccessType to prevent
spurious copy-back. That approach was incomplete:

- AccessType describes the *parameter binding* (in/out/inout modifiers)
- It does NOT describe whether the underlying *tensor data* is writable

The original fix worked for scalar broadcast (tensor → float) but broke
tensor parameters (tensor → RWTensor<T,N>) because:
- RWTensor parameter: binding is "in RWTensor<...>" → AccessType::read
- But the tensor DATA is writable and must be copied back

The new approach checks the *target type* instead:
- Simple types (scalar/vector/matrix): Broadcast case, tensor is read-only
  → Skip copy-back (prevents autograd _version increment)
- Complex types (struct/resource): May include tensor types
  → Use m_writable to determine copy-back (preserves output data)

This correctly handles both cases:
1. Tensor → float: No copy-back, autograd works
2. Tensor → RWTensor: Copy-back occurs, output data preserved

Adds tests for both cases to prevent regression.

Co-authored-by: Cursor <cursoragent@cursor.com>
The previous fix (b590ae3) incorrectly skipped copy-back for ALL tensors
bound to simple types (scalar/vector/matrix), including outputs.

For scalar functions like `float slang_relu(float x)`:
- INPUT tensor → float: Read-only broadcast, skip copy-back (correct)
- OUTPUT tensor → float: Contains results, MUST copy back (was broken)

The fix now checks BOTH conditions to skip copy-back:
1. Target type is simple (scalar/vector/matrix)
2. Access type is read-only (not write or readwrite)

This ensures outputs (which have write access) are always copied back,
while inputs (read access) skip the copy-back that would break autograd.

Co-authored-by: Cursor <cursoragent@cursor.com>
- Remove numbered test case labels (Case 1, 2a, 2b)
- Remove references to specific reviewers in comments
- Improve section header clarity

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi. This is great work - sorry to be picky on the binding - it's just important we absolutely nail that particular logic, both in terms of correctness and performance.

I've detailed what I think is the only robust path for us now - ultimately looking at the type that is being bound to, which we conveniently do have access to when calculating offsets. I think what the actual slang side type that we're binding the tensor to (note not the same as the argument type of the user's function) is the concrete answer to when/whether copies are needed.

AccessType primal_access = binding->access().first;
bool primal_is_writable = (primal_access == AccessType::write || primal_access == AccessType::readwrite);
bool needs_primal_copyback = m_writable && primal_is_writable && primal_info.numel > 0 && primal_interop_buffer;
bool needs_grad_copyback = has_grad && grad_info.numel > 0 && grad_interop_buffer;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not happy with this - it's getting too complex, and will slow down the dispatch. The simplest is just to check if it is writable, though as you point out that could result in redundant copying.

Probably the correct logic, if we want to do it, would be to adjust the logic that caches offsets to look at the slang uniform type that is being bound to, and identify it is a writable tensor type (by name). This could work for gradient based tensors too. i.e. in that situation we're saying:

  • if the user has explicitly written a function that takes a writable tensor, assume it is being written to
  • otherwise, if the user has written a scalar function we're vectorizing across, and it has an 'out/inout' param, assume it is being written to

Exactly the same principle could be applied to the gradients, making that logic more robust/optimal as well.

As that code only happens once, it would not slow down dispatches. (though we'd want to rename some bits so it didn't claim to just be caching offsets when it in fact was caching more than that!).

Ultimately the actual decision making for this is in tensorcommon.py, where we generate the call data for the tensor. This is where we decide what actual tensor type the generated kernel uses. Another option would be to write some boolean value 'needs copying' inside gen_calldata, but that feels wrong to me - gen_calldata by implication is immutable, and many design decisions are based on the idea that a marshall is complete from the moment it's created.

We probably also need to apply this logic to the normal tensor right?

jhelferty-nv and others added 5 commits February 10, 2026 15:28
Add tests to verify gradient interop buffer copy-back works correctly:
- test_gradient_copyback_single_op: Single input gradient verification
- test_gradient_copyback_multiple_inputs: Multiple inputs with gradient values

These tests will serve as regression tests for the upcoming refactoring
that moves copy-back decisions from dispatch time to cache time.

Co-authored-by: Cursor <cursoragent@cursor.com>
Move the torch tensor copy-back decision from runtime dispatch to the
offset caching phase. This avoids expensive runtime type reflection
(vector_type()->type_reflection()->kind()) on every dispatch.

Changes:
- Add needs_primal_copyback and needs_grad_copyback fields to CachedOffsets
- Compute these flags in ensure_offsets_cached() using binding type/access
- Simplify write_shader_cursor_with_interop() to use pre-computed flags
- Apply same optimization to gradient copy-back (now checks AccessType)

The copy-back logic remains the same:
- Read-only simple types (scalar/vector/matrix): no copy-back
- Writable outputs or tensor types: copy-back as needed
- Gradients: copy-back only when access is write/readwrite

Co-authored-by: Cursor <cursoragent@cursor.com>
The struct now contains more than just shader offsets - it also holds
copy-back decision flags. Rename to better reflect its purpose:

- CachedOffsets → CachedBindingInfo
- ensure_offsets_cached → ensure_binding_info_cached
- m_cached_offsets → m_cached_binding_info
- extract_offsets → extract_binding_info

Co-authored-by: Cursor <cursoragent@cursor.com>
Clarify that copy-back decisions are made at cache time in C++
(ensure_binding_info_cached) based on Slang parameter type and
access mode, not just the writable flag.

Co-authored-by: Cursor <cursoragent@cursor.com>
Note that these flags are only used by NativeTorchTensorMarshall;
NativeTensorMarshall leaves them as default (false).

Co-authored-by: Cursor <cursoragent@cursor.com>
jhelferty-nv and others added 6 commits February 11, 2026 12:54
Update m_cached_offsets to m_cached_binding_info in new code
from main branch.

Co-authored-by: Cursor <cursoragent@cursor.com>
The refactoring in 2cd616c moved copy-back decisions to cache time,
but this broke gradient copy-back for raw torch.Tensor inputs.

Root cause: needs_grad_copyback was computed using has_derivative(),
which returns false for raw torch.Tensor inputs (they don't have
d_in/d_out marshalls at construction). However, during the backward
pass, gradients ARE present and need to be copied back.

Fix: Use runtime has_grad check instead of cached needs_grad_copyback
for gradient copy-back decisions. This matches the original behavior
where gradient copy-back was determined by whether grad_value was
actually present at runtime.

This fixes test_tensor_interfaces and test_tensor_generic failing
on Vulkan/D3D12 with all-zero gradient outputs.

Co-authored-by: Cursor <cursoragent@cursor.com>
This field was never used - gradient copy-back decisions are made
at runtime using the has_grad parameter, not cached at bind time.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Handle backward pass when primal tensor is None (output slots in
  autograd backward). Create a zeroed interop buffer instead of
  crashing, and skip primal copy-back when there is no source tensor.

- Fix broadcast stride zeroing order: apply after make_contiguous_strides
  so broadcast dimensions correctly use stride 0 for interop buffers.
  Also apply to the direct CUDA pointer path.

Co-authored-by: Cursor <cursoragent@cursor.com>
jhelferty-nv and others added 5 commits February 12, 2026 17:20
…ternal memory

cuExternalMemoryGetMappedBuffer returns a CUdeviceptr that must be freed
with cuMemFree before calling cuDestroyExternalMemory. Without this, the
CUDA driver keeps the underlying allocation alive, leaking ~64KB+ per
interop buffer and eventually hitting VK_ERROR_OUT_OF_DEVICE_MEMORY.

Co-authored-by: Cursor <cursoragent@cursor.com>
Tests exercise realistic ML/optimization workflows using PyTorch autograd,
mirroring patterns from the slang-torch examples:

- Polynomial optimization (cubic polynomial fitting with Adam)
- Bezier curve fitting (control point optimization)
- Two-layer MLP optimization (chained linear_transform -> relu4 -> dot4)
- Multi-output optimization (2D vector output)
- Gradient correctness with broadcast parameters
- Multiple backward passes (no state leak)
- Interleaved slangpy + pure PyTorch optimization

Each test validates convergence on both CUDA and Vulkan backends.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ang-torch ports

The original slang-torch examples use DiffTensorView, [CUDAKernel],
[AutoPyBindCUDA], and manual torch.autograd.Function — none of which
are used here. Reframe documentation to be explicit that these tests
validate slangpy's PyTorch autograd integration for the same categories
of workload, not ports of the original code. Reference shader-slang#740 and shader-slang#768
as prerequisites for true parity tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
Better reflects the file's contents: PyTorch autograd integration tests
exercising workflow patterns (optimizer loops, gradient accumulation,
chained calls), not end-to-end parity tests.

Co-authored-by: Cursor <cursoragent@cursor.com>
@jhelferty-nv jhelferty-nv changed the title Add PyTorch gradient parity tests and fix autograd compatibility bug Add PyTorch autograd tests and fix interop/copy-back bugs Feb 12, 2026
Copy link
Contributor

@ccummingsNV ccummingsNV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch on the memory leak.

I think the binding is still fragile, and needs to piggy back off what the system already decided rather than attempt to re-calculate writability. The shader cursor already has enough information to say exactly whether the tensor is writable.

We also need to make that memset async if it is to be used in a performant way.

// counter via copy_(), breaking autograd's version tracking.
// Writable outputs MUST copy back to return results.
//
// For tensor types (Tensor, RWTensor, etc.), copy back if marshall is writable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good that it's being cached, but I honestly think the solution here, as I said below, is to check the type of the tensor being bound to, not attempt to infer writability from the vector type. To clarify:

  • vector type == type being passed to the user's function
  • slang type == the uniform type stored in the generated kernel's call data, which this marshall represents (this is what the cursor is pointing at)

Python side, a lot of logic goes into working out whether the slang type should be Tensor, WTensor, RWTensor, etc etc. This is the 'ground truth' wtih regards to whether copying is needed. A read-only tensor uniform does not need copy-back. A write-only tensor uniform technically does not need initial copy.

The same principle applies to differentiable tensors, though is more complex:

  • DiffTensor == read only primal, writable gradient (grad out)
  • WDiffTensor == writable primal, readable gradient (grad in)
  • RWDiffTensor == rw primal, rw gradient (grad out + grad in)

So by trying to work things out from the vector type here, you're effectively trying to replicate logic that already happened python side. Even if you get it right for now, it'll mean 2 places this logic needs to be maintained.

});
void* cuda_ptr = interop_buffer->cuda_memory();
if (cuda_ptr && buffer_size > 0)
cuda::memset_device(static_cast<uint8_t*>(cuda_ptr), 0, buffer_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a none-async call, so would block the full gfx pipeline on execution. It should utilize a cuMemSetAsync, and provide the stream this operation is being run on.

// This handles non-contiguous tensors via PyTorch's copy mechanism
// copy_to_buffer() now throws on error with detailed message
if (info.numel > 0 && info.data_ptr != nullptr) {
TorchBridge::instance().copy_to_buffer(tensor_value, interop_buffer->cuda_memory(), buffer_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't write this, but just noted - for related reasons to the cuda::memset_device point, this could cause a bug. If the copy_to_buffer I wrote isn't running on the correct cuda stream, we could end up with the copy occuring after the actual dispatch.

jhelferty-nv and others added 3 commits February 17, 2026 16:59
Resolve conflict in slangpytorchtensor.cpp: integrate TensorView and
DiffTensorView support (shader-slang#775) with CachedBindingInfo naming.

Fix stale m_cached_offsets reference in slangpytensor.cpp from the
auto-merge of the TensorView code.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace the vector_type kind + AccessType approach for determining
copy-back with the Slang uniform type name from the shader cursor.
The Python layer already determines the concrete Slang tensor type
(Tensor/WTensor/RWTensor/DiffTensor/WDiffTensor/RWDiffTensor) — this
is the ground truth for writability.

Also cache gradient copy-back at bind time using the same approach:
  DiffTensor   → read primal, write grad → copy back grad
  WDiffTensor  → write primal, read grad → no grad copy-back
  RWDiffTensor → rw primal, rw grad      → copy back grad

This avoids maintaining writability logic in two places and makes the
copy-back decision robust against future tensor type changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace synchronous cuda::memset_device (cuMemsetD8) with async
cuda::memset_device_async (cuMemsetD8Async) when zeroing interop
buffers for backward pass output slots. The synchronous version
blocks the host until completion, stalling the GPU pipeline.

The async version uses the default CUDA stream (stream 0), which
is ordered with respect to all other operations and does not block
the host.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link

coderabbitai bot commented Feb 18, 2026

📝 Walkthrough

Walkthrough

Adds extensive PyTorch gradient-parity and autograd workflow tests for SlangPy, refactors tensor marshalling to cache binding info (including copy-back flags) instead of offsets, exposes a CUDA-stream-aware async memset, and threads CUDA stream into SlangPy CallContext.

Changes

Cohort / File(s) Summary
PyTorch Gradient Parity Testing
slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py
New comprehensive test suite comparing gradients between native PyTorch ops and SlangPy-wrapped kernels across activations, losses, slicing/strides, transposes, multi-op sequences, and copy-back scenarios. Adds test utilities and SlangPy kernel snippets.
PyTorch Autograd Workflow Testing
slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py
New end-to-end autograd/optimization tests (polynomial, Bezier fitting, two-layer MLP, multi-output, broadcast-grad checks, interleaved SlangPy+PyTorch flows) exercising optimizer convergence and gradient correctness.
Tensor Marshalling Refactor (binding info + copy-back flags)
src/slangpy_ext/utils/slangpytensor.h, src/slangpy_ext/utils/slangpytensor.cpp, src/slangpy_ext/utils/slangpytorchtensor.h, src/slangpy_ext/utils/slangpytorchtensor.cpp, src/slangpy_ext/utils/slangpytensor.cpp
Renamed CachedOffsets → CachedBindingInfo, added needs_primal_copyback and needs_grad_copyback flags, replaced ensure_offsets_cached → ensure_binding_info_cached, and updated all marshalling/read/write paths to use binding-info fields and runtime-deduced writability.
CallContext CUDA stream plumbing
src/sgl/utils/slangpy.h, src/slangpy_ext/utils/slangpy.cpp
CallContext constructor now accepts a NativeHandle cuda_stream and exposes a cuda_stream() accessor; SlangPy exec path extracts and passes CUDA stream into CallContext and Python binding updated accordingly.
CUDA async memset & ExternalMemory cleanup
src/sgl/device/cuda_utils.h, src/sgl/device/cuda_utils.cpp
Added memset_device_async(dst, value, count, CUstream) for async device clears and freed mapped device memory in ExternalMemory destructor to prevent leaks.
Torch marshalling comment only
slangpy/torchintegration/torchtensormarshall.py
Added clarifying comments about writable flag semantics and that copy-back decisions are made in C++ (no functional change).

Sequence Diagram(s)

sequenceDiagram
  participant Py as PyTorch (Python)
  participant Bind as SlangPy Binding (C++)
  participant Marshall as Tensor Marshall
  participant GPU as GPU / CUDA
  participant Torch as Torch Tensor Memory

  Py->>Bind: call slang kernel (includes CUDA stream)
  Bind->>Marshall: ensure_binding_info_cached(cursor, binding)
  Marshall-->>Bind: return CachedBindingInfo (offsets, writable flags)
  Bind->>GPU: dispatch kernel (using provided stream)
  GPU-->>Bind: kernel finishes (async stream)
  Bind->>Marshall: post-dispatch copy-back decision (needs_primal_copyback / needs_grad_copyback)
  Marshall->>Torch: perform copy-back to torch tensor memory if needed
  Bind-->>Py: return result tensor (views/contiguity preserved)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • ccummingsNV

Poem

"🐰 I hopped through offsets, binding info in tow,
Gradients matched, slice and stride in a row.
CUDA streams hum, async clears on the fly,
Copy-backs now known, no surprises nearby.
Cheers — a rabbit's nibble on code we bestow!"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 49.15% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the primary changes: adding PyTorch autograd tests and fixing interop/copy-back bugs, which aligns with the comprehensive changeset across test files and system-level interop fixes.
Linked Issues check ✅ Passed The PR addresses issue #733 by adding comprehensive PyTorch tests (test_torch_autograd_workflows.py, test_pytorch_gradient_parity.py) that exercise complex tensor usage, optimizer loops, gradient accumulation, and mixed slangpy+PyTorch operations as explicitly requested.
Out of Scope Changes check ✅ Passed All changes are directly related to the PR objectives: new test modules, interop buffer fixes (cuda_utils, slangpytensor refactoring), copy-back decision caching, and stream threading through CallContext—all supporting the testing and bugfix goals.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py (1)

170-184: Consider renaming unused loop variable epoch to _epoch.

The static analysis correctly identifies that epoch is not used within the loop body. While this doesn't affect functionality, renaming to _epoch would silence the linter and signal intent. This applies to lines 170, 254, 334, 403, and 577.

Example fix (applies to all similar loops)
-    for epoch in range(300):
+    for _epoch in range(300):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py` around lines
170 - 184, The loop variable "epoch" is unused in multiple training loops (e.g.,
the loop that calls optimizer.zero_grad(), module.cubic_poly(...),
loss.backward(), optimizer.step(), and assert_loss_decreased) — rename "epoch"
to "_epoch" in each of these for-loops (occurrences you noted) to satisfy the
linter and communicate intent; update all for epoch in range(...) declarations
to for _epoch in range(...) in the test_torch_autograd_workflows.py file where
the training loops are defined.
slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py (1)

444-444: Refactor lambda to function definition.

The linter correctly flags this lambda assignment. Using def is more readable and allows for docstrings if needed.

Proposed fix
-    pytorch_mse = lambda output, tgt: nn.functional.mse_loss(output, tgt)
+    def pytorch_mse(output: torch.Tensor, tgt: torch.Tensor) -> torch.Tensor:
+        return nn.functional.mse_loss(output, tgt)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py` at line 444, The
assignment using a lambda named pytorch_mse should be replaced by a normal
function definition: find the lambda assignment "pytorch_mse = lambda output,
tgt: nn.functional.mse_loss(output, tgt)" and refactor it into a def
pytorch_mse(output, tgt): that returns nn.functional.mse_loss(output, tgt)
(optionally add a one-line docstring); update any references to pytorch_mse
unchanged since the name stays the same.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@slangpy/tests/slangpy_tests/test_pytorch_gradient_parity.py`:
- Line 444: The assignment using a lambda named pytorch_mse should be replaced
by a normal function definition: find the lambda assignment "pytorch_mse =
lambda output, tgt: nn.functional.mse_loss(output, tgt)" and refactor it into a
def pytorch_mse(output, tgt): that returns nn.functional.mse_loss(output, tgt)
(optionally add a one-line docstring); update any references to pytorch_mse
unchanged since the name stays the same.

In `@slangpy/tests/slangpy_tests/test_torch_autograd_workflows.py`:
- Around line 170-184: The loop variable "epoch" is unused in multiple training
loops (e.g., the loop that calls optimizer.zero_grad(), module.cubic_poly(...),
loss.backward(), optimizer.step(), and assert_loss_decreased) — rename "epoch"
to "_epoch" in each of these for-loops (occurrences you noted) to satisfy the
linter and communicate intent; update all for epoch in range(...) declarations
to for _epoch in range(...) in the test_torch_autograd_workflows.py file where
the training loops are defined.

jhelferty-nv and others added 2 commits February 19, 2026 13:20
memset_device_async was using stream 0 (default), which can race with
work on the PyTorch CUDA stream. Thread the stream from
NativeCallRuntimeOptions through CallContext so interop buffer
operations use the same stream as the dispatch.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add tests: Complex use of torch tensors

2 participants

Comments