Clarify extension and allocator docs

shi-eric · shi-eric · commit 398c8684bc64 · 2026-05-30T21:25:54.000Z
Several extension docs used imprecise terminology around diagnostics,
custom allocator scope, and native snippets. The allocator docs also
introduced RMM without expanding the acronym, and the native function
section only showed CUDA launches despite pure C++ snippets working on
CPU kernels.

Update the docs to distinguish environment diagnostics from internal
logging, spell out RAPIDS Memory Manager, describe allocator routing as
current support, and add a CUDA inline PTX example using vabsdiff4 for
packed-byte SAD. This keeps the examples practical while documenting the
boundary between CPU-compatible C++ snippets and CUDA-only native code.

Signed-off-by: Eric Shi &lt;ershi@nvidia.com&gt;
diff --git a/docs/deep_dive/allocators.rst b/docs/deep_dive/allocators.rst
@@ -246,7 +246,7 @@ Limitations
 Mempool-to-Mempool Copies Between GPUs During Graph Capture
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices.  Note that this only applies to capturing mempool-to-mempool copies in a graph; copies done outside of graph capture are not affected.  Copies within the same mempool (i.e., same device) are also not affected.
+Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices.  Note that this only applies to capturing mempool-to-mempool copies in a graph.  Copies done outside of graph capture are not affected.  Copies within the same mempool (i.e., same device) are also not affected.
 
 There are two workarounds.  If mempool access is supported, you can simply enable mempool access between the devices prior to graph capture, as shown in :ref:`mempool_access`.
 
@@ -274,11 +274,13 @@ Custom Allocators
 -----------------
 
 Warp supports pluggable memory allocators for CUDA devices. The public extension
-API is introduced in :doc:`../user_guide/extending_warp`; this section provides
-complete PyTorch and RMM allocator examples and allocator-specific caveats.
-Custom allocators only affect :class:`warp.array` allocations on CUDA devices;
-CPU allocations, pinned memory, and internal native allocations (e.g., BVH
-construction temporaries) are not affected.
+API is introduced in :doc:`../user_guide/extending_warp`. This section provides
+complete PyTorch and RAPIDS Memory Manager (RMM) allocator examples and
+allocator-specific caveats.
+Custom allocators currently affect :class:`warp.array` allocations on CUDA
+devices only. Custom allocator routing for CPU allocations, pinned memory, and
+internal native allocations (e.g., BVH construction temporaries) is not
+currently supported.
 
 Setting a Custom Allocator
 ~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -398,11 +400,11 @@ PyTorch's cache, implement a small custom allocator that calls
 PyTorch tracks the device and stream for pointers returned by
 ``caching_allocator_alloc()``, so ``caching_allocator_delete()`` only needs the
 pointer. The ``_active_allocations`` dictionary above is for validation and
-debugging; applications can customize this tracking for their own accounting,
+debugging. Applications can customize this tracking for their own accounting,
 thread-safety, or distributed runtime needs.
 
-RMM Integration
-~~~~~~~~~~~~~~~
+RAPIDS Memory Manager (RMM) Integration
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 `RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`_ provides high-performance
 pooled allocators for CUDA. Warp includes a built-in adapter, :class:`~warp.utils.AllocatorRmm`, that
diff --git a/docs/user_guide/debugging.rst b/docs/user_guide/debugging.rst
@@ -78,7 +78,7 @@ non-differentiable.
 
     Reading or setting either deprecated flag emits a one-time
     ``DeprecationWarning``. During the deprecation window the flag is still
-    honored alongside ``log_level``, so existing code keeps working; remove the
+    honored alongside ``log_level``, so existing code keeps working. Remove the
     flag once your code sets ``log_level`` directly.
 
     ``wp.config.verbose_warnings`` is not deprecated. It is an orthogonal
@@ -97,15 +97,21 @@ This can be useful in identifying where a particular warning is being emitted fr
 Custom Loggers
 ^^^^^^^^^^^^^^
 
-By default Warp routes diagnostics through a built-in logger that writes
-debug and info messages to ``sys.stdout``, errors to ``sys.stderr``, and
-routes warnings through Python's :mod:`warnings` filter machinery so that
-``-W`` flags and :func:`warnings.simplefilter` work as expected.
-
-Applications and frameworks can install a custom logger to capture or redirect
-Warp diagnostics. See :doc:`extending_warp` for the :class:`wp.Logger
-<warp.Logger>` protocol, :func:`wp.set_logger() <warp.set_logger>`, and
-:class:`wp.ScopedLogger <warp.ScopedLogger>` examples.
+Warp's internals emit library messages for status, debugging, warnings,
+deprecations, and errors. These messages are separate from
+:func:`wp.print_diagnostics() <warp.print_diagnostics>`, which prints an
+explicit snapshot of the build and runtime environment.
+
+By default, Warp routes internal library messages through a built-in logger.
+The built-in logger writes debug and info messages to ``sys.stdout``, errors to
+``sys.stderr``, and warnings through Python's :mod:`warnings` filter machinery
+so that ``-W`` flags and :func:`warnings.simplefilter` work as expected.
+
+Applications and frameworks can install a custom logger to filter, capture, or
+redirect messages issued from Warp internals through their own logging systems.
+See :doc:`extending_warp` for the :class:`wp.Logger <warp.Logger>` protocol,
+:func:`wp.set_logger() <warp.set_logger>`, and :class:`wp.ScopedLogger
+<warp.ScopedLogger>` examples.
 
 .. _debug-mode:
 
diff --git a/docs/user_guide/extending_warp.rst b/docs/user_guide/extending_warp.rst
@@ -5,7 +5,7 @@ Extending Warp
 
 Warp exposes several public extension points for applications and frameworks
 that need behavior beyond the built-in kernel language, allocation policy, or
-diagnostics routing. These APIs are intended for power users embedding Warp into
+internal logging. These APIs are intended for power users embedding Warp into
 larger systems or reaching low-level C++/CUDA functionality directly.
 
 Native Functions
@@ -16,6 +16,11 @@ into generated Warp modules. Native functions are useful when Warp does not
 provide a built-in operation, CUDA intrinsic, synchronization pattern, or
 low-level expression that your kernel needs.
 
+Pure C++ snippets, meaning snippets without CUDA-only constructs, can be used
+by CPU kernels. The same snippet can also be used by CUDA kernels if the code is
+valid device code. CUDA-specific constructs, such as ``__shared__`` memory,
+``__syncthreads()``, and CUDA atomics, require CUDA kernels.
+
 The decorator takes native source code as a string. The decorated Python
 function is a typed stub: its arguments define the names and types available to
 the snippet, and its body should be ``...`` because Warp replaces the body with
@@ -44,20 +49,17 @@ snippets are inserted into generated C++/CUDA, so they cannot call
         increment(x, out, tid)
 
 
-    x = wp.array(np.arange(4, dtype=np.float32), dtype=wp.float32, device="cuda")
+    device = "cpu"
+    x = wp.array(np.arange(4, dtype=np.float32), dtype=wp.float32, device=device)
     out = wp.zeros_like(x)
-    wp.launch(increment_kernel, dim=x.shape, inputs=[x], outputs=[out], device="cuda")
-
-Pure C++ snippets can be used by CPU kernels. CUDA-specific constructs, such as
-``__shared__`` memory, ``__syncthreads()``, and CUDA atomics, require CUDA
-kernels.
+    wp.launch(increment_kernel, dim=x.shape, inputs=[x], outputs=[out], device=device)
 
 CUDA Shared Memory
 ~~~~~~~~~~~~~~~~~~
 
 Native snippets can use CUDA features that Warp does not expose directly. The
 following example performs a reduction within a single 128-thread block using
-shared memory. It assumes the launch uses exactly one block; generalizing this
+shared memory. It assumes the launch uses exactly one block. Generalizing this
 pattern to multiple blocks requires using a per-block thread index and storing
 one result per block.
 
@@ -100,6 +102,73 @@ one result per block.
     out = wp.zeros(1, dtype=wp.int32, device="cuda")
     wp.launch(reduce_kernel, dim=128, inputs=[arr], outputs=[out], block_dim=128, device="cuda")
 
+Inline PTX
+~~~~~~~~~~
+
+Native snippets can also use `inline Parallel Thread Execution (PTX) assembly
+<https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html>`_ inside CUDA
+code. Inline PTX is useful when you need a GPU instruction that is not exposed
+directly through Warp or CUDA C++.
+
+The following example computes a sum of absolute differences (SAD) for four
+packed 8-bit values at a time. This pattern appears in image matching, stereo
+vision, and descriptor comparisons, where adjacent grayscale pixels or feature
+bytes are often compared in small groups. The `PTX vabsdiff4 instruction
+<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#simd-video-instructions-vadd4-vsub4-vavrg4-vabsdiff4-vmin4-vmax4>`_
+performs four byte-wise absolute differences and, with the ``.add`` modifier,
+accumulates them into one 32-bit result.
+
+.. code-block:: python
+
+    import numpy as np
+    import warp as wp
+
+    snippet = r"""
+        unsigned int result;
+        unsigned int zero = 0;
+        asm("vabsdiff4.u32.u32.u32.add %0, %1, %2, %3;"
+            : "=r"(result)
+            : "r"(a), "r"(b), "r"(zero));
+        return result;
+        """
+
+
+    @wp.func_native(snippet)
+    def sad4_packed_u8(a: wp.uint32, b: wp.uint32) -> wp.uint32:
+        ...
+
+
+    @wp.kernel
+    def sad_kernel(
+        a: wp.array(dtype=wp.uint32),
+        b: wp.array(dtype=wp.uint32),
+        out: wp.array(dtype=wp.uint32),
+    ):
+        tid = wp.tid()
+        out[tid] = sad4_packed_u8(a[tid], b[tid])
+
+
+    def pack4(values):
+        return np.uint32(values[0] | (values[1] << 8) | (values[2] << 16) | (values[3] << 24))
+
+
+    a_host = np.array([pack4([10, 20, 30, 40]), pack4([0, 128, 255, 13])], dtype=np.uint32)
+    b_host = np.array([pack4([13, 18, 41, 35]), pack4([255, 120, 0, 15])], dtype=np.uint32)
+
+    a = wp.array(a_host, dtype=wp.uint32, device="cuda")
+    b = wp.array(b_host, dtype=wp.uint32, device="cuda")
+    out = wp.zeros_like(a)
+    wp.launch(sad_kernel, dim=a.shape, inputs=[a, b], outputs=[out], device="cuda")
+
+    # [3 + 2 + 11 + 5, 255 + 8 + 255 + 2]
+    np.testing.assert_array_equal(out.numpy(), np.array([21, 520], dtype=np.uint32))
+
+The ``"r"`` constraints bind the operands to 32-bit integer registers, which
+matches the ``.u32`` instruction operands. The final PTX operand is an
+accumulator and is supplied as a zero-initialized register in this example. If
+the assembly reads or writes memory through pointers, add the appropriate
+``"memory"`` clobber as described in NVIDIA's inline PTX documentation.
+
 Returning Values
 ~~~~~~~~~~~~~~~~
 
@@ -212,22 +281,23 @@ changes:
 
     wp.set_cuda_allocator(None)
 
-Custom allocators affect CUDA :class:`warp.array` allocations. They do not
-affect CPU allocations, pinned memory, external arrays, or internal native
-allocations such as BVH construction temporaries. Allocators that do not support
-stream-ordered allocation may not work correctly during CUDA graph capture.
+Custom allocators currently affect CUDA :class:`warp.array` allocations. Custom
+allocator routing for CPU allocations, pinned memory, external arrays, and
+internal native allocations such as BVH construction temporaries is not
+currently supported. Allocators that do not support stream-ordered allocation
+may not work correctly during CUDA graph capture.
 
-See :ref:`custom_allocators` for PyTorch and RMM allocator examples, and see
-:doc:`../deep_dive/allocators` for memory-pool behavior, graph-capture details,
-and multi-GPU allocator access.
+See :ref:`custom_allocators` for PyTorch and RAPIDS Memory Manager (RMM)
+allocator examples, and see :doc:`../deep_dive/allocators` for memory-pool
+behavior, graph-capture details, and multi-GPU allocator access.
 
 Custom Loggers
 --------------
 
-Applications can route Warp diagnostics through their own logging system by
-installing a custom logger. :class:`wp.Logger <warp.Logger>` is a
-runtime-checkable :class:`~typing.Protocol`, so the logger only needs to provide
-the expected methods.
+Applications can route messages issued from Warp internals through their own
+logging system by installing a custom logger. :class:`wp.Logger <warp.Logger>`
+is a runtime-checkable :class:`~typing.Protocol`, so the logger only needs to
+provide the expected methods.
 
 .. code-block:: python
 
@@ -245,7 +315,7 @@ arguments even if the adapter ignores them. Warp passes those arguments by name
 so that the built-in logger can integrate with Python's :mod:`warnings` filter
 machinery.
 
-To forward Warp diagnostics into Python's standard :mod:`logging` module, wrap a
+To forward these messages into Python's standard :mod:`logging` module, wrap a
 standard logger in a small adapter:
 
 .. code-block:: python
diff --git a/warp/_src/context.py b/warp/_src/context.py
@@ -1399,7 +1399,7 @@ def my_kernel_with_launch_bounds(a: wp.array[float]):
 
 
         @wp.kernel(module_options={"fast_math": True}, module="unique")
-        def my_kernel_fast(a: wp.array(dtype=float), b: wp.array(dtype=float)):
+        def my_kernel_fast(a: wp.array[float], b: wp.array[float]):
             # fast_math is a module-level option, so module="unique" is required
             tid = wp.tid()
             b[tid] = a[tid] + 1.0