Torch examples (#459)

ccummingsNV · web-flow · commit c4a7ea8593f1 · 2025-08-27T13:20:13.000+01:00
* Fix torch examples + add copilot info

* Update samples repo
diff --git a/.github/copilot-instructions.md b/.github/copilot-instructions.md
@@ -0,0 +1,55 @@
+Act as a senior C++ and Python developer, with extensive experience in low graphics libraries such as Vulkan, Direct 3D 12 and Cuda, and implementation of native Python extensions.
+
+Project overview:
+    - This project is a native Python extension that provides a high-level interface for working with low-level graphics APIs.
+    - The majority of the native side provides a more python friendly wrapper around the slang-rhi project (in #external/slang-rhi)
+    - The python extension exposes this code using nanobind bindings
+    - The project also contains a predominantly Python system that allows the user to 'call' a slang function on the gpu with Python function call syntax
+
+Directory structure:
+    - The majority of the native c++ code is in #src/sgl
+    - The python bindings are in the directory #src/slangpy_ext
+    - The python code is in #slangpy
+    - The python tests are in #slangpy/tests
+    - The c++ tests are in #tests
+
+Code structure:
+    - Any new python api must have tests added in #slangpy/tests.
+    - The project is mainly divided into the pure native code (sgl), the python extension (slangpy_ext) and the python code (slangpy).
+    - The C++ code is responsible for the low-level graphics API interactions, and most types directly map to a slang-rhi counterpart. i.e. Device wraps the slang-rhi rhi::IDevice type
+
+Building:
+    - To build the project, run "cmake --build ./build --config Debug"
+
+Testing:
+    - Python tests are in #slangpy/tests and C++ tests are in #tests
+    - The Python testing system uses pytest
+    - The C++ testing system uses doctest
+    - Always build before running tests.
+    - To run all Python tests, run "pytest slangpy/tests"
+
+C++ Code style:
+    - Class names should start with a capital letter.
+    - Function names are in snake_case.
+    - Local variable names are in snake_case.
+    - Member variables start with "m_" and are in snake_case.
+
+Python code style:
+    - Class names should start with a capital letter.
+    - Function names are in snake_case.
+    - Local variable names are in snake_case.
+    - Member variables start with "m_" and are in snake_case.
+    - All arguments should have type annotations.
+
+Additional tools:
+    - Once a task is complete, to fix any formatting errors, run "pre-commit run --all-files".
+    - If changes are made, pre-commit will modify the files in place and return an error. Re-running the command should then succeed.
+
+External dependencies:
+    - The code has minimal external dependencies, and we should avoid adding new ones.
+    - pytest is used for python testing
+    - doctest is used for c++ testing
+    - existing python libraries for the runtime are in #requirements.txt
+    - python libraries for development (eg tests) are in #requirements-dev.txt
+    - the slang shading language is used for writing shaders
+    - most external c++ dependencies are in #external
diff --git a/docs/src/autodiff/pytorch.rst b/docs/src/autodiff/pytorch.rst
@@ -1,26 +1,31 @@
 PyTorch
 =======
 
-Building on the previous `auto-diff <autodiff.html>`_ example, the switch to PyTorch and its auto-grad capabilities is trivial.
+Building on the previous `auto-diff <autodiff.html>`_ example, the switch to PyTorch and its auto-grad capabilities is straightforward.
 
 Initialization
 --------------
 
-The critical line that changes is loading the module:
+To use SlangPy with PyTorch, you first need to create a device configured for PyTorch integration:
 
 .. code-block:: python
 
-    # Load torch wrapped module.
-    module = spy.TorchModule.load_from_file(device, "example.slang")
+    import slangpy as spy
+    import torch
 
-Here, rather than simply write ``spy.Module.load_from_file``, we write ``spy.TorchModule.load_from_file``. From here, all structures or functions utilizing the module will support PyTorch tensors and be injected into PyTorch's auto-grad graph.
+    # Create a device configured for PyTorch integration
+    # CUDA backend is recommended for best performance
+    device = spy.create_torch_device(type=spy.DeviceType.cuda)
 
-In future SlangPy versions we intend to remove the need for wrapping altogether, instead auto-detecting the need for auto-grad support at the point of call.
+    # Load module using the standard Module type
+    module = spy.Module.load_from_file(device, "example.slang")
+
+SlangPy automatically detects when PyTorch tensors are used and integrates them into PyTorch's auto-grad graph. No special module types are needed - you can use the standard ``spy.Module`` type as documented in `First Functions <../basics/firstfunctions.html>`_.
 
 Creating a tensor
 -----------------
 
-Now, rather than use a SlangPy ``Tensor``, we create a ``torch.Tensor`` tensor to store the inputs:
+Now, rather than use a SlangPy ``Tensor``, we create a ``torch.Tensor`` to store the inputs:
 
 .. code-block:: python
 
@@ -30,16 +35,16 @@ Now, rather than use a SlangPy ``Tensor``, we create a ``torch.Tensor`` tensor t
 Note:
 
 - We set ``requires_grad=True`` to tell PyTorch to track the gradients of this tensor.
-- We set ``device='cuda'`` to ensure the tensor is on the GPU.
+- We set ``device='cuda'`` to ensure the tensor is on the GPU and matches our device configuration.
 
 Running the kernel
 ------------------
 
-Calling the function is pretty much unchanged, however calculation of gradients is now done via PyTorch:
+Calling the function is unchanged from the standard SlangPy API, but calculation of gradients is now done via PyTorch:
 
 .. code-block:: python
 
-    # Evaluate the polynomial. Result will now default to a torch tensor.
+    # Evaluate the polynomial. Result will automatically be a torch tensor.
     # Expecting result = 2x^2 + 8x - 1
     result = module.polynomial(a=2, b=8, c=-1, x=x)
     print(result)
@@ -49,20 +54,54 @@ Calling the function is pretty much unchanged, however calculation of gradients
     result.backward(torch.ones_like(result))
     print(x.grad)
 
-This works because the wrapped PyTorch module automatically wrapped the call to `polynomial` in a custom autograd function. As a result, the call to `result.backwards` automatically called `module.polynomial.bwds`.
+This works because SlangPy automatically detects PyTorch tensors and wraps the call to `polynomial` in a custom autograd function. As a result, the call to `result.backward` automatically invokes `module.polynomial.bwds` to compute gradients.
+
+Device Backend Selection
+------------------------
+
+SlangPy supports multiple backend types for PyTorch integration:
+
+**CUDA Backend (Recommended)**
+
+The CUDA backend provides the best performance by directly sharing the CUDA context with PyTorch:
+
+.. code-block:: python
+
+    device = spy.create_torch_device(type=spy.DeviceType.cuda)
+
+This approach avoids expensive context switching and memory copies, making it ideal for performance-critical applications.
+
+**Graphics Backends (D3D12, Vulkan)**
+
+For applications that need access to graphics features (such as rasterization), you can use D3D12 or Vulkan backends:
+
+.. code-block:: python
+
+    # D3D12 backend (Windows only)
+    device = spy.create_torch_device(type=spy.DeviceType.d3d12)
+
+    # Vulkan backend (Cross-platform)
+    device = spy.create_torch_device(type=spy.DeviceType.vulkan)
+
+These backends use CUDA interop with shared memory and semaphores to synchronize between SlangPy and PyTorch. While functional, this approach has higher overhead due to hardware context switching and memory copies.
 
 A word on performance
 ---------------------
 
-This example showed a very basic use of PyTorch's auto-grad capabilities. However in practice, the switch from a CUDA PyTorch context to a D3D or Vulkan context has an overhead. Typically, very simple logic will be faster in PyTorch. However as functions become more complex, writing them as simple scalar processes that are vectorized by SlangPy and wrapped in PyTorch quickly becomes apparent.
+The choice of backend significantly impacts performance:
 
-Additionally, we intend to add a pure CUDA backend to SlangPy in the future, which will allow for seamless switching between PyTorch and SlangPy contexts.
+- **CUDA Backend**: Provides the best performance for compute-focused workloads. Very simple operations may still be faster in pure PyTorch, but as functions become more complex, the benefits of SlangPy's vectorization and GPU optimization become apparent.
+
+- **Graphics Backends (D3D12/Vulkan)**: Useful when graphics features are required, but expect substantially worse performance due to context switching overhead. Consider whether the graphics features are truly necessary for your use case.
 
 Summary
 -------
 
-That's it! You can now use PyTorch tensors with SlangPy, and take advantage of PyTorch's auto-grad capabilities. This example covered:
+PyTorch integration with SlangPy is seamless and automatic. This example covered:
+
+- Device creation using `create_torch_device` with support for CUDA, D3D12, and Vulkan backends
+- Automatic detection of PyTorch tensors - no special module types required
+- Use of PyTorch's `.backward()` process to track an auto-grad graph and backpropagate gradients
+- Performance considerations when choosing between CUDA and graphics backends
 
-- Initialization with a `TorchModule` to enable PyTorch support
-- Use of PyTorch's `.backward` process to track an auto-grad graph and back propagate gradients.
-- Performance considerations when wrapping Slang code with PyTorch.
+The CUDA backend is recommended for best performance, while graphics backends provide access to additional GPU features at the cost of some performance overhead.
diff --git a/samples b/samples
@@ -1 +1 @@
-Subproject commit 26114d80bfa3bd87c134d477c09957146c9d2a89
+Subproject commit ed2bea680eb2388639ea2e0183e5c3f26d3b5162
diff --git a/slangpy/__init__.py b/slangpy/__init__.py
@@ -25,7 +25,7 @@
 # SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 # pyright: reportUnusedImport=false
 # isort: skip_file
-from .core.utils import create_device
+from .core.utils import create_device, create_torch_device
 import runpy
 import pathlib
 
diff --git a/slangpy/core/utils.py b/slangpy/core/utils.py
@@ -94,8 +94,13 @@ def create_torch_device(
     # Import and init torch
     import torch
 
+    # Ensure torch cuda is initialized
     torch.cuda.init()
 
+    # These lines ensure that torch has set a default context
+    torch.cuda.current_device()
+    torch.cuda.current_stream()
+
     # Use current device if not specified
     if torch_device is None:
         torch_device = torch.cuda.current_device()