NVIDIA
diff --git a/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎design/hardware-coherent-memory-access.md‎
Lines changed: 338 additions & 76 deletions b/‎design/hardware-coherent-memory-access.md‎
Lines changed: 338 additions & 76 deletions
diff --git a/‎design/pluggable-allocators.md‎
Lines changed: 1 addition & 1 deletion b/‎design/pluggable-allocators.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/api_reference/warp.rst‎
Lines changed: 2 additions & 0 deletions b/‎docs/api_reference/warp.rst‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/deep_dive/allocators.rst‎
Lines changed: 121 additions & 0 deletions b/‎docs/deep_dive/allocators.rst‎
Lines changed: 121 additions & 0 deletions
@@ -12,6 +12,10 @@
 - Extend AddressSanitizer support to JIT-compiled CPU kernels: when `warp-clang` is built with `--sanitize=address`, CPU
   kernels are automatically instrumented and share the host's single in-process ASan runtime, so out-of-bounds accesses
   into a `wp.array` are reported as `heap-buffer-overflow` ([GH-1387](https://github.com/NVIDIA/warp/issues/1387)).
+- Add `wp.ManagedAllocator()` for explicit CUDA managed-memory arrays. CPU kernels can use managed arrays as an
+  opt-in path to read and write CUDA data directly on systems where CUDA reports compatible managed-memory access,
+  while ordinary Warp CUDA arrays still need explicit CPU copies. Preallocated managed arrays work in CUDA graph
+  captures, but capture-time allocation is a current limitation ([GH-1523](https://github.com/NVIDIA/warp/issues/1523)).
 
 ### Removed
 
 
@@ -231,7 +231,7 @@ Future solutions must provide enough allocation provenance for
 make the same conservative decisions they make for Warp-owned allocations. At a
 minimum, Warp needs to distinguish the owning device and memory class for
 allocations that participate in cross-device launch verification, including
-default CUDA device memory, CUDA memory pools, managed memory, pinned host
+CUDA malloc memory, CUDA memory pools, managed memory, pinned host
 memory, and allocator-defined external memory.
 
 Any future mechanism must remain backward compatible with simple custom
 
@@ -384,7 +384,9 @@ CUDA Memory Management
    :nosignatures:
    :toctree: _generated
 
+   AllocationKind
    Allocator
+   ManagedAllocator
    ScopedAllocator
    ScopedMempool
    ScopedMempoolAccess
 
@@ -311,6 +311,127 @@ For temporary allocator changes, use the :class:`ScopedAllocator` context manage
         a = wp.zeros(1000, dtype=wp.float32, device="cuda:0")
     # Original allocator is restored here
 
+.. _managed_memory_allocation_options:
+
+Managed Memory Allocator
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Managed memory is CUDA-managed storage that can be addressed from CPU and GPU
+code. CUDA Unified Memory manages page placement and migration, so pages may move
+between CPU and GPU memory as different processors touch them. Unlike pinned CPU
+memory, which remains host memory that a GPU may access through a host mapping,
+managed memory gives Warp arrays a different tradeoff from the other allocation
+options:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 18 29 27 26
+
+   * - Allocation option
+     - Residency and migration
+     - CPU/GPU access
+     - Typical use
+   * - Default CUDA
+     - Device memory with no automatic CPU/GPU migration.
+     - CUDA kernels access it directly; CPU code uses explicit copies.
+     - General GPU arrays when CPU access is staged explicitly.
+   * - CUDA mempool
+     - Device memory from CUDA's stream-ordered pool, with no automatic CPU/GPU
+       migration.
+     - Same CPU/GPU access rules as default CUDA memory, with separate
+       memory-pool access controls for peer GPUs.
+     - Faster repeated CUDA allocations and graph-captured allocation when
+       supported.
+   * - Pinned CPU
+     - Host memory that does not migrate into device memory as an allocation.
+     - CPU code accesses it directly; CUDA devices with unified virtual
+       addressing can access it through a host mapping.
+     - Asynchronous CPU/GPU copies or zero-copy access to small host-resident
+       data.
+   * - CUDA managed
+     - CUDA Unified Memory whose pages may migrate between CPU and GPU memory.
+     - CPU and GPU access follow CUDA managed-memory support and synchronization
+       rules.
+     - Sharing data across CPU/GPU code when migration is preferable to manual
+       copies.
+
+:class:`ManagedAllocator` creates CUDA managed-memory arrays through Warp's
+allocator interface. Managed arrays keep their CUDA device metadata, but
+``wp.can_access()`` and checked launch validation use CUDA managed-memory access
+rules for them instead of peer-access or memory-pool-access rules.
+
+One major reason to choose this allocator is CPU/GPU shared work: on systems
+where CUDA reports compatible managed-memory access, CPU kernels can directly
+read and write managed CUDA arrays instead of maintaining a separate CPU copy.
+Standard Warp CUDA arrays remain non-managed and still require explicit copies
+before CPU code accesses them.
+
+The allocator object is not bound to one CUDA device and can be constructed
+before choosing a CUDA device. Warp invokes it under the target device's CUDA
+context, which must support CUDA managed memory, and records that context as
+the owner for each pointer:
+
+.. code:: python
+
+    managed = wp.ManagedAllocator()
+    device = wp.get_device("cuda:0")
+
+    with wp.ScopedAllocator(device, managed):
+        a = wp.zeros(1000, dtype=wp.float32, device=device)
+
+Constructing a :class:`ManagedAllocator` does not promise that pages initially
+reside in any device's physical memory, and it does not bypass the device's
+managed-memory capability check. The CUDA device used for each allocation
+identifies the owner context and array device metadata; CUDA Unified Memory
+manages physical placement and migration.
+
+Use :attr:`array.allocation_kind <warp.array.allocation_kind>` to inspect Warp's
+verified allocation provenance:
+
+.. code:: python
+
+    if a.allocation_kind is wp.AllocationKind.CUDA_MANAGED:
+        ...
+
+The allocation kind describes how Warp believes the storage was allocated. It
+does not describe the current physical residency of CUDA managed memory, and
+views report the allocation kind of their owner array.
+
+To use managed memory as a persistent allocator for all CUDA devices, install one
+allocator instance with :func:`set_cuda_allocator`:
+
+.. code:: python
+
+    managed = wp.ManagedAllocator()
+    wp.set_cuda_allocator(managed)
+
+If only some CUDA devices should use managed memory, install the same allocator
+with :func:`set_device_allocator` on those devices. A single allocator instance
+can serve multiple CUDA devices, but allocation fails clearly on any target
+device that does not report CUDA managed-memory support.
+
+Direct calls to ``ManagedAllocator.allocate()`` require an active CUDA context.
+Array factory functions such as :func:`zeros` and :func:`empty` pass the target
+device context automatically and perform the same managed-memory support check.
+
+Managed allocations currently have a CUDA graph-capture limitation in Warp:
+:class:`ManagedAllocator` does not allocate a new array while CUDA graph capture
+is active. If you need managed arrays with CUDA graphs, allocate them before
+capture begins and reuse the existing arrays inside the captured work. This is
+an implementation limitation, not a restriction on using pre-existing managed
+arrays in captured work.
+
+CPU access to managed arrays is hardware-dependent. Use :func:`can_access` to
+check a specific managed array before CPU code reads or writes it directly:
+
+.. code:: python
+
+    if wp.can_access("cpu", a):
+        wp.launch(cpu_kernel, dim=a.size, inputs=[a], device="cpu")
+    else:
+        a_cpu = a.to("cpu")
+        wp.launch(cpu_kernel, dim=a_cpu.size, inputs=[a_cpu], device="cpu")
+
 Writing a Custom Allocator
 ~~~~~~~~~~~~~~~~~~~~~~~~~~