Skip to content

Commit 398c868

Browse files
committed
Clarify extension and allocator docs
Several extension docs used imprecise terminology around diagnostics, custom allocator scope, and native snippets. The allocator docs also introduced RMM without expanding the acronym, and the native function section only showed CUDA launches despite pure C++ snippets working on CPU kernels. Update the docs to distinguish environment diagnostics from internal logging, spell out RAPIDS Memory Manager, describe allocator routing as current support, and add a CUDA inline PTX example using vabsdiff4 for packed-byte SAD. This keeps the examples practical while documenting the boundary between CPU-compatible C++ snippets and CUDA-only native code. Signed-off-by: Eric Shi <ershi@nvidia.com>
1 parent d06c2ad commit 398c868

4 files changed

Lines changed: 118 additions & 40 deletions

File tree

docs/deep_dive/allocators.rst

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -246,7 +246,7 @@ Limitations
246246
Mempool-to-Mempool Copies Between GPUs During Graph Capture
247247
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
248248

249-
Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices. Note that this only applies to capturing mempool-to-mempool copies in a graph; copies done outside of graph capture are not affected. Copies within the same mempool (i.e., same device) are also not affected.
249+
Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices. Note that this only applies to capturing mempool-to-mempool copies in a graph. Copies done outside of graph capture are not affected. Copies within the same mempool (i.e., same device) are also not affected.
250250

251251
There are two workarounds. If mempool access is supported, you can simply enable mempool access between the devices prior to graph capture, as shown in :ref:`mempool_access`.
252252

@@ -274,11 +274,13 @@ Custom Allocators
274274
-----------------
275275

276276
Warp supports pluggable memory allocators for CUDA devices. The public extension
277-
API is introduced in :doc:`../user_guide/extending_warp`; this section provides
278-
complete PyTorch and RMM allocator examples and allocator-specific caveats.
279-
Custom allocators only affect :class:`warp.array` allocations on CUDA devices;
280-
CPU allocations, pinned memory, and internal native allocations (e.g., BVH
281-
construction temporaries) are not affected.
277+
API is introduced in :doc:`../user_guide/extending_warp`. This section provides
278+
complete PyTorch and RAPIDS Memory Manager (RMM) allocator examples and
279+
allocator-specific caveats.
280+
Custom allocators currently affect :class:`warp.array` allocations on CUDA
281+
devices only. Custom allocator routing for CPU allocations, pinned memory, and
282+
internal native allocations (e.g., BVH construction temporaries) is not
283+
currently supported.
282284

283285
Setting a Custom Allocator
284286
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -398,11 +400,11 @@ PyTorch's cache, implement a small custom allocator that calls
398400
PyTorch tracks the device and stream for pointers returned by
399401
``caching_allocator_alloc()``, so ``caching_allocator_delete()`` only needs the
400402
pointer. The ``_active_allocations`` dictionary above is for validation and
401-
debugging; applications can customize this tracking for their own accounting,
403+
debugging. Applications can customize this tracking for their own accounting,
402404
thread-safety, or distributed runtime needs.
403405

404-
RMM Integration
405-
~~~~~~~~~~~~~~~
406+
RAPIDS Memory Manager (RMM) Integration
407+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
406408

407409
`RAPIDS Memory Manager (RMM) <https://github.com/rapidsai/rmm>`_ provides high-performance
408410
pooled allocators for CUDA. Warp includes a built-in adapter, :class:`~warp.utils.AllocatorRmm`, that

docs/user_guide/debugging.rst

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ non-differentiable.
7878

7979
Reading or setting either deprecated flag emits a one-time
8080
``DeprecationWarning``. During the deprecation window the flag is still
81-
honored alongside ``log_level``, so existing code keeps working; remove the
81+
honored alongside ``log_level``, so existing code keeps working. Remove the
8282
flag once your code sets ``log_level`` directly.
8383

8484
``wp.config.verbose_warnings`` is not deprecated. It is an orthogonal
@@ -97,15 +97,21 @@ This can be useful in identifying where a particular warning is being emitted fr
9797
Custom Loggers
9898
^^^^^^^^^^^^^^
9999

100-
By default Warp routes diagnostics through a built-in logger that writes
101-
debug and info messages to ``sys.stdout``, errors to ``sys.stderr``, and
102-
routes warnings through Python's :mod:`warnings` filter machinery so that
103-
``-W`` flags and :func:`warnings.simplefilter` work as expected.
104-
105-
Applications and frameworks can install a custom logger to capture or redirect
106-
Warp diagnostics. See :doc:`extending_warp` for the :class:`wp.Logger
107-
<warp.Logger>` protocol, :func:`wp.set_logger() <warp.set_logger>`, and
108-
:class:`wp.ScopedLogger <warp.ScopedLogger>` examples.
100+
Warp's internals emit library messages for status, debugging, warnings,
101+
deprecations, and errors. These messages are separate from
102+
:func:`wp.print_diagnostics() <warp.print_diagnostics>`, which prints an
103+
explicit snapshot of the build and runtime environment.
104+
105+
By default, Warp routes internal library messages through a built-in logger.
106+
The built-in logger writes debug and info messages to ``sys.stdout``, errors to
107+
``sys.stderr``, and warnings through Python's :mod:`warnings` filter machinery
108+
so that ``-W`` flags and :func:`warnings.simplefilter` work as expected.
109+
110+
Applications and frameworks can install a custom logger to filter, capture, or
111+
redirect messages issued from Warp internals through their own logging systems.
112+
See :doc:`extending_warp` for the :class:`wp.Logger <warp.Logger>` protocol,
113+
:func:`wp.set_logger() <warp.set_logger>`, and :class:`wp.ScopedLogger
114+
<warp.ScopedLogger>` examples.
109115

110116
.. _debug-mode:
111117

docs/user_guide/extending_warp.rst

Lines changed: 90 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ Extending Warp
55

66
Warp exposes several public extension points for applications and frameworks
77
that need behavior beyond the built-in kernel language, allocation policy, or
8-
diagnostics routing. These APIs are intended for power users embedding Warp into
8+
internal logging. These APIs are intended for power users embedding Warp into
99
larger systems or reaching low-level C++/CUDA functionality directly.
1010

1111
Native Functions
@@ -16,6 +16,11 @@ into generated Warp modules. Native functions are useful when Warp does not
1616
provide a built-in operation, CUDA intrinsic, synchronization pattern, or
1717
low-level expression that your kernel needs.
1818

19+
Pure C++ snippets, meaning snippets without CUDA-only constructs, can be used
20+
by CPU kernels. The same snippet can also be used by CUDA kernels if the code is
21+
valid device code. CUDA-specific constructs, such as ``__shared__`` memory,
22+
``__syncthreads()``, and CUDA atomics, require CUDA kernels.
23+
1924
The decorator takes native source code as a string. The decorated Python
2025
function is a typed stub: its arguments define the names and types available to
2126
the snippet, and its body should be ``...`` because Warp replaces the body with
@@ -44,20 +49,17 @@ snippets are inserted into generated C++/CUDA, so they cannot call
4449
increment(x, out, tid)
4550
4651
47-
x = wp.array(np.arange(4, dtype=np.float32), dtype=wp.float32, device="cuda")
52+
device = "cpu"
53+
x = wp.array(np.arange(4, dtype=np.float32), dtype=wp.float32, device=device)
4854
out = wp.zeros_like(x)
49-
wp.launch(increment_kernel, dim=x.shape, inputs=[x], outputs=[out], device="cuda")
50-
51-
Pure C++ snippets can be used by CPU kernels. CUDA-specific constructs, such as
52-
``__shared__`` memory, ``__syncthreads()``, and CUDA atomics, require CUDA
53-
kernels.
55+
wp.launch(increment_kernel, dim=x.shape, inputs=[x], outputs=[out], device=device)
5456
5557
CUDA Shared Memory
5658
~~~~~~~~~~~~~~~~~~
5759

5860
Native snippets can use CUDA features that Warp does not expose directly. The
5961
following example performs a reduction within a single 128-thread block using
60-
shared memory. It assumes the launch uses exactly one block; generalizing this
62+
shared memory. It assumes the launch uses exactly one block. Generalizing this
6163
pattern to multiple blocks requires using a per-block thread index and storing
6264
one result per block.
6365

@@ -100,6 +102,73 @@ one result per block.
100102
out = wp.zeros(1, dtype=wp.int32, device="cuda")
101103
wp.launch(reduce_kernel, dim=128, inputs=[arr], outputs=[out], block_dim=128, device="cuda")
102104
105+
Inline PTX
106+
~~~~~~~~~~
107+
108+
Native snippets can also use `inline Parallel Thread Execution (PTX) assembly
109+
<https://docs.nvidia.com/cuda/inline-ptx-assembly/index.html>`_ inside CUDA
110+
code. Inline PTX is useful when you need a GPU instruction that is not exposed
111+
directly through Warp or CUDA C++.
112+
113+
The following example computes a sum of absolute differences (SAD) for four
114+
packed 8-bit values at a time. This pattern appears in image matching, stereo
115+
vision, and descriptor comparisons, where adjacent grayscale pixels or feature
116+
bytes are often compared in small groups. The `PTX vabsdiff4 instruction
117+
<https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#simd-video-instructions-vadd4-vsub4-vavrg4-vabsdiff4-vmin4-vmax4>`_
118+
performs four byte-wise absolute differences and, with the ``.add`` modifier,
119+
accumulates them into one 32-bit result.
120+
121+
.. code-block:: python
122+
123+
import numpy as np
124+
import warp as wp
125+
126+
snippet = r"""
127+
unsigned int result;
128+
unsigned int zero = 0;
129+
asm("vabsdiff4.u32.u32.u32.add %0, %1, %2, %3;"
130+
: "=r"(result)
131+
: "r"(a), "r"(b), "r"(zero));
132+
return result;
133+
"""
134+
135+
136+
@wp.func_native(snippet)
137+
def sad4_packed_u8(a: wp.uint32, b: wp.uint32) -> wp.uint32:
138+
...
139+
140+
141+
@wp.kernel
142+
def sad_kernel(
143+
a: wp.array(dtype=wp.uint32),
144+
b: wp.array(dtype=wp.uint32),
145+
out: wp.array(dtype=wp.uint32),
146+
):
147+
tid = wp.tid()
148+
out[tid] = sad4_packed_u8(a[tid], b[tid])
149+
150+
151+
def pack4(values):
152+
return np.uint32(values[0] | (values[1] << 8) | (values[2] << 16) | (values[3] << 24))
153+
154+
155+
a_host = np.array([pack4([10, 20, 30, 40]), pack4([0, 128, 255, 13])], dtype=np.uint32)
156+
b_host = np.array([pack4([13, 18, 41, 35]), pack4([255, 120, 0, 15])], dtype=np.uint32)
157+
158+
a = wp.array(a_host, dtype=wp.uint32, device="cuda")
159+
b = wp.array(b_host, dtype=wp.uint32, device="cuda")
160+
out = wp.zeros_like(a)
161+
wp.launch(sad_kernel, dim=a.shape, inputs=[a, b], outputs=[out], device="cuda")
162+
163+
# [3 + 2 + 11 + 5, 255 + 8 + 255 + 2]
164+
np.testing.assert_array_equal(out.numpy(), np.array([21, 520], dtype=np.uint32))
165+
166+
The ``"r"`` constraints bind the operands to 32-bit integer registers, which
167+
matches the ``.u32`` instruction operands. The final PTX operand is an
168+
accumulator and is supplied as a zero-initialized register in this example. If
169+
the assembly reads or writes memory through pointers, add the appropriate
170+
``"memory"`` clobber as described in NVIDIA's inline PTX documentation.
171+
103172
Returning Values
104173
~~~~~~~~~~~~~~~~
105174

@@ -212,22 +281,23 @@ changes:
212281
213282
wp.set_cuda_allocator(None)
214283
215-
Custom allocators affect CUDA :class:`warp.array` allocations. They do not
216-
affect CPU allocations, pinned memory, external arrays, or internal native
217-
allocations such as BVH construction temporaries. Allocators that do not support
218-
stream-ordered allocation may not work correctly during CUDA graph capture.
284+
Custom allocators currently affect CUDA :class:`warp.array` allocations. Custom
285+
allocator routing for CPU allocations, pinned memory, external arrays, and
286+
internal native allocations such as BVH construction temporaries is not
287+
currently supported. Allocators that do not support stream-ordered allocation
288+
may not work correctly during CUDA graph capture.
219289

220-
See :ref:`custom_allocators` for PyTorch and RMM allocator examples, and see
221-
:doc:`../deep_dive/allocators` for memory-pool behavior, graph-capture details,
222-
and multi-GPU allocator access.
290+
See :ref:`custom_allocators` for PyTorch and RAPIDS Memory Manager (RMM)
291+
allocator examples, and see :doc:`../deep_dive/allocators` for memory-pool
292+
behavior, graph-capture details, and multi-GPU allocator access.
223293

224294
Custom Loggers
225295
--------------
226296

227-
Applications can route Warp diagnostics through their own logging system by
228-
installing a custom logger. :class:`wp.Logger <warp.Logger>` is a
229-
runtime-checkable :class:`~typing.Protocol`, so the logger only needs to provide
230-
the expected methods.
297+
Applications can route messages issued from Warp internals through their own
298+
logging system by installing a custom logger. :class:`wp.Logger <warp.Logger>`
299+
is a runtime-checkable :class:`~typing.Protocol`, so the logger only needs to
300+
provide the expected methods.
231301

232302
.. code-block:: python
233303
@@ -245,7 +315,7 @@ arguments even if the adapter ignores them. Warp passes those arguments by name
245315
so that the built-in logger can integrate with Python's :mod:`warnings` filter
246316
machinery.
247317

248-
To forward Warp diagnostics into Python's standard :mod:`logging` module, wrap a
318+
To forward these messages into Python's standard :mod:`logging` module, wrap a
249319
standard logger in a small adapter:
250320

251321
.. code-block:: python

warp/_src/context.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1399,7 +1399,7 @@ def my_kernel_with_launch_bounds(a: wp.array[float]):
13991399

14001400

14011401
@wp.kernel(module_options={"fast_math": True}, module="unique")
1402-
def my_kernel_fast(a: wp.array(dtype=float), b: wp.array(dtype=float)):
1402+
def my_kernel_fast(a: wp.array[float], b: wp.array[float]):
14031403
# fast_math is a module-level option, so module="unique" is required
14041404
tid = wp.tid()
14051405
b[tid] = a[tid] + 1.0

0 commit comments

Comments
 (0)