You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Several extension docs used imprecise terminology around diagnostics,
custom allocator scope, and native snippets. The allocator docs also
introduced RMM without expanding the acronym, and the native function
section only showed CUDA launches despite pure C++ snippets working on
CPU kernels.
Update the docs to distinguish environment diagnostics from internal
logging, spell out RAPIDS Memory Manager, describe allocator routing as
current support, and add a CUDA inline PTX example using vabsdiff4 for
packed-byte SAD. This keeps the examples practical while documenting the
boundary between CPU-compatible C++ snippets and CUDA-only native code.
Signed-off-by: Eric Shi <ershi@nvidia.com>
Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices. Note that this only applies to capturing mempool-to-mempool copies in a graph; copies done outside of graph capture are not affected. Copies within the same mempool (i.e., same device) are also not affected.
249
+
Copying data between different GPUs will fail during graph capture if the source and destination are allocated using mempool allocators and mempool access is not enabled between devices. Note that this only applies to capturing mempool-to-mempool copies in a graph. Copies done outside of graph capture are not affected. Copies within the same mempool (i.e., same device) are also not affected.
250
250
251
251
There are two workarounds. If mempool access is supported, you can simply enable mempool access between the devices prior to graph capture, as shown in :ref:`mempool_access`.
252
252
@@ -274,11 +274,13 @@ Custom Allocators
274
274
-----------------
275
275
276
276
Warp supports pluggable memory allocators for CUDA devices. The public extension
277
-
API is introduced in :doc:`../user_guide/extending_warp`; this section provides
278
-
complete PyTorch and RMM allocator examples and allocator-specific caveats.
279
-
Custom allocators only affect :class:`warp.array` allocations on CUDA devices;
280
-
CPU allocations, pinned memory, and internal native allocations (e.g., BVH
281
-
construction temporaries) are not affected.
277
+
API is introduced in :doc:`../user_guide/extending_warp`. This section provides
278
+
complete PyTorch and RAPIDS Memory Manager (RMM) allocator examples and
279
+
allocator-specific caveats.
280
+
Custom allocators currently affect :class:`warp.array` allocations on CUDA
281
+
devices only. Custom allocator routing for CPU allocations, pinned memory, and
282
+
internal native allocations (e.g., BVH construction temporaries) is not
283
+
currently supported.
282
284
283
285
Setting a Custom Allocator
284
286
~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -398,11 +400,11 @@ PyTorch's cache, implement a small custom allocator that calls
398
400
PyTorch tracks the device and stream for pointers returned by
399
401
``caching_allocator_alloc()``, so ``caching_allocator_delete()`` only needs the
400
402
pointer. The ``_active_allocations`` dictionary above is for validation and
401
-
debugging; applications can customize this tracking for their own accounting,
403
+
debugging. Applications can customize this tracking for their own accounting,
0 commit comments