Warp v1.13.0

Warp v1.13 introduces experimental graph capture serialization with CPU replay, letting captured simulations roundtrip through a portable .wrp file and load from standalone C++ on either GPU or CPU. It also adds an experimental cuBQL BVH backend for wp.Mesh that accelerates ray-heavy mesh queries, the wp.bfloat16 scalar type, a pluggable CUDA allocator interface with built-in RAPIDS Memory Manager (RMM) integration, scoped memory tracking with C++-layer call-site attribution, and a batch of new tile primitives (tile_dot, tile_axpy, tile_stack, scatter helpers).

New features

Graph capture serialization and CPU replay

Important

This is an experimental feature. The API may change without a formal deprecation cycle.

Warp v1.13 introduces a portable serialized-graph format. Operations recorded during wp.capture_begin(apic=True) / wp.capture_end() can be saved to a .wrp file with wp.capture_save() and replayed from either Python or standalone C++ via wp.capture_load(), enabling cross-process and cross-language graph reuse (#1349). CPU graph capture is also new in this release: the same wp.Graph object now replays on CPU through wp.capture_launch(), and the underlying APIC operation log is what gets serialized. A new wp.handle (a uint64 alias) carries wp.Mesh handles across save and load so kernels can keep referencing meshes after deserialization.

import warp as wp

with wp.ScopedDevice("cpu"):
    a = wp.zeros(64, dtype=float)
    b = wp.zeros(64, dtype=float)

    wp.capture_begin(apic=True)
    wp.copy(b, a)
    graph = wp.capture_end()

    wp.capture_save(graph, "demo", inputs={"a": a}, outputs={"b": b})

# Later (in the same process or a fresh one): replay from disk.
with wp.ScopedDevice("cpu"):
    loaded = wp.capture_load("demo")
    loaded.set_param("a", wp.array([1.0] * 64, dtype=float))
    wp.capture_launch(loaded)

Loading and replaying from standalone C++ (CPU device shown). The full example also walks the _modules/ directory, loads each .o via wp_load_obj, resolves kernel symbols, and registers them with wp_apic_register_loaded_cpu_kernel before the first replay. The snippet below elides that boilerplate:

#include "apic.h"
#include "warp.h"

wp_init(nullptr);
APICGraph graph = wp_apic_load_graph(nullptr, "demo.wrp", 1);  // 1 = CPU device

// (Walk demo_modules/, load each .o, and register kernels. See linked example.)

wp_apic_set_param(graph, "a", a_buffer, a_size);
wp_apic_cpu_replay_graph(graph);              // For CUDA: cudaGraphLaunch(wp_apic_get_cuda_graph_exec(graph), stream)
wp_apic_get_param(graph, "b", b_buffer, b_size);

wp_apic_destroy_graph(graph);

See warp/examples/cpp/02_apic_visualization (CUDA replay) and warp/examples/cpp/03_apic_visualization_cpu (CPU replay) for end-to-end demos with OpenGL visualization.

What gets written:

demo.wrp                  # operation byte stream + region snapshots + metadata
demo_modules/
    <hash>.cubin / .meta  # one per CUDA kernel module, arch-pinned
    <hash>.o              # one per CPU kernel module (CPU capture)

Key capabilities:

wp.capture_save(graph, path, inputs=..., outputs=...) registers named bindings so the consumer side can swap in fresh inputs and read outputs by name without touching the graph topology.
wp.capture_load() + wp.capture_launch() support replay on both CPU and CUDA. Loaded graphs expose set_param, get_param, and get_param_ptr for each registered binding, plus params and is_loaded properties on wp.Graph.
wp.handle scalar type and wp.Mesh remap let kernels accept mesh handles whose underlying objects are reconstructed on load. APIC walks @wp.struct fields recursively to find handle pointers and remap them.

Stability and known gaps:

API Capture is experimental, and we plan to keep adding capabilities and closing gaps over future releases (tracker: #1388). For now, regenerate .wrp artifacts when upgrading Warp. The current operation set, handle types, and platform constraints are documented in the Graphs section of the user guide.

cuBQL BVH backend for `wp.Mesh`

Important

This is an experimental feature. The API may change without a formal deprecation cycle.

wp.Mesh now accepts bvh_constructor="cubql" to build its acceleration structure with cuBQL, an Apache 2.0-licensed header-only CUDA library for fast BVH construction and traversal (#1286). For ray-heavy workloads on dense static meshes, where the existing SAH builder's exhaustive construction dominates setup time and where ray traversal sits on the simulation hot path, cuBQL typically delivers faster ray queries alongside consistently lower build times than the SAH, median, and LBVH builders. As one specific data point, a Warp-based renderer benchmark on an RTX 4090 (Franka Emika Panda visual mesh, 8192 parallel worlds) saw simulation time drop from 1.41 s to 0.98 s after switching the constructor with no other changes. Speedups depend heavily on mesh size, query mix, and how much of the frame the mesh queries occupy, so benchmark on your own scene before relying on a particular win.

The "cubql" backend currently only routes wp.mesh_query_ray() through cuBQL's traversal kernels. Extending it to point queries, AABB queries, grouped queries, and winding-number support is future work. Today, passing groups=... or support_winding_number=True to a cuBQL wp.Mesh raises a RuntimeError at construction. Calling wp.mesh_query_point_* or wp.mesh_query_aabb_* against a cuBQL mesh silently returns no results. Stick with the default SAH/median/LBVH builders for kernels that mix query types or aren't ray-bound.

import warp as wp

mesh = wp.Mesh(
    points=points,           # wp.array of wp.vec3
    indices=tri_indices,     # wp.array of wp.int32, shape (num_tris * 3,)
    bvh_constructor="cubql",
)

Pluggable CUDA allocator and RMM integration

CUDA device-memory allocations can now be routed through any object implementing the wp.Allocator protocol via wp.set_cuda_allocator(), wp.set_device_allocator(), or scoped with wp.ScopedAllocator (#781). The built-in wp.utils.AllocatorRmm delegates to RAPIDS Memory Manager so Warp can share a memory pool with PyTorch, CuPy, or any other RMM-aware framework, eliminating duplicate caching on GPUs that train and simulate in the same process.

import rmm
import warp as wp

rmm.reinitialize(pool_allocator=True, initial_pool_size=2**30)
wp.set_cuda_allocator(wp.utils.AllocatorRmm())

a = wp.zeros(1024, dtype=wp.float32, device="cuda:0")  # served from the RMM pool

wp.utils.AllocatorRmm requires the rmm package on Linux (pip install rmm-cu12).

Memory tracking

wp.ScopedMemoryTracker and the wp.config.track_memory flag enable allocation tracking with call-site attribution and per-category reports across GPU, host, and pinned-host memory (#1269). Tracking is implemented in the C++ native layer by intercepting all wp_alloc_* / wp_free_* calls, so internal allocations from BVH, hash-grid, mesh, volume, and sparse subsystems show up alongside Python-originated arrays, labeled with their subsystem (e.g. (native:bvh), (native:hashgrid), (native:mesh)).

import warp as wp


@wp.kernel
def fill(x: wp.array[float]):
    i = wp.tid()
    x[i] = float(i)


with wp.ScopedMemoryTracker("training step"):
    a = wp.zeros(1_000_000, dtype=wp.float32, device="cuda:0")
    b = wp.zeros(2_000_000, dtype=wp.float32, device="cuda:0")
    wp.launch(fill, dim=a.size, inputs=[a])
    wp.launch(fill, dim=b.size, inputs=[b])

Output:

Allocation Tracking Report
  Total allocations:
    cuda:0           2 (11.44 MB)
  Peak usage:
    cuda:0           11.44 MB
  Live allocations:
    (none)

Nest trackers to form hierarchical scopes (e.g. "simulation/collision"), and pass report_func=... to redirect the report to a logger or test-assert callback instead of stdout. For unscoped tracking across an entire process, set wp.config.track_memory = True before wp.init() and call wp.print_memory_report() at any time.

`wp.bfloat16` scalar data type

wp.bfloat16 joins wp.float16, wp.float32, and wp.float64 as a first-class Warp scalar type (#1332). It supports array allocation, kernel execution, autodiff, DLPack, PyTorch (wp.from_torch(t) for torch.bfloat16 tensors), JAX, and optional NumPy interop via the ml_dtypes package. wp.tile_matmul() and atomic operations also accept bfloat16 tiles, so transformer-style mixed-precision kernels can stay in Warp end to end.

import warp as wp


@wp.kernel
def axpy(a: wp.array[wp.bfloat16], b: wp.array[wp.bfloat16], out: wp.array[wp.bfloat16]):
    i = wp.tid()
    out[i] = a[i] * wp.bfloat16(2.0) + b[i]


a = wp.array([1.0, 2.0, 3.0], dtype=wp.bfloat16)
b = wp.array([0.5, 0.5, 0.5], dtype=wp.bfloat16)
out = wp.zeros_like(a)
wp.launch(axpy, dim=a.size, inputs=[a, b], outputs=[out])
print(out.numpy())  # [2.5 4.5 6.5]

Note

Long chains of bfloat16 math accumulate rounding error at bfloat16 precision because each intermediate result is quantized back to 16 bits before feeding into the next op. This is true whether the underlying op runs on native bf16 hardware (Warp dispatches +, -, * to PTX bf16 instructions on sm_80 and newer GPUs) or goes through a float32 round-trip (/, comparisons, and math built-ins like wp.sqrt / wp.exp, matching how PyTorch, JAX, and CuPy handle those ops). For precision-sensitive code, cast to wp.float32 at the start of the chain and back to wp.bfloat16 only at the boundary so intermediates stay in float32.

Zero-copy texture interop

wp.Texture1D, wp.Texture2D, and wp.Texture3D accept an externally allocated CUDA array via the new cuda_array= parameter, sharing memory without an extra copy (#1238). For OpenGL interop, the new wp.GLTextureResource class registers an OpenGL texture for use as a Warp texture, so rendering pipelines can sample from textures Warp writes (and vice versa) without going through host memory. Texture objects also gain copy_from() and copy_to() methods for transfers between textures, host arrays, and device arrays. The previous copy_from_array() / copy_to_array() methods are now deprecated.

import warp as wp

# Wrap an existing CUDA array (cuArrayHandle as Python int) in a Warp texture.
tex = wp.Texture2D(
    cuda_array=external_cuda_array,
    width=1024,
    height=1024,
    channels=4,
    dtype=wp.float32,
    filter_mode=wp.Texture.FILTER_LINEAR,
)

# Round-trip into a Warp array without intermediate host buffers.
out = wp.zeros((1024, 1024, 4), dtype=wp.float32, device="cuda:0")
tex.copy_to(out)

Tile programming enhancements

New tile primitives

This release adds several tile primitives that shorten common kernel patterns:

wp.tile_dot(a, b) (#1364) computes the dot product of two same-shape tiles, returning a single-element tile of the underlying scalar type. For tiles of vectors or matrices, each element pair is fully contracted (e.g. wp.dot(a[i], b[i]) for tiles of vec3f).
wp.tile_axpy(alpha, src, dest) (#1363) performs the fused in-place update dest += alpha * src without allocating an intermediate scaled tile.
wp.tile_scatter_add(a, i, value, has_value, atomic=True) (#1342) is a cooperative scatter-add into a shared-memory tile. Set atomic=False for faster writes when indices are guaranteed unique across threads.
wp.tile_scatter_masked() (#1298) writes per-thread values into a shared-memory tile with cooperative synchronization, so each lane can publish a value at its own index without manual barriers.
wp.tile_query_valid() (#1335) exposes a cleaner loop condition for tile BVH and mesh AABB queries that avoids the wp.tile_max() reduction overhead the previous pattern required.

import warp as wp


@wp.kernel
def reduce_dot():
    a = wp.tile_ones(dtype=float, shape=64)
    b = wp.tile_ones(dtype=float, shape=64) * 2.0
    d = wp.tile_dot(a, b)
    print(d)


wp.launch_tiled(reduce_dot, dim=[1], inputs=[], block_dim=64)

Output:

[128] = tile(shape=(1), storage=register)

Cooperative thread-block stack

wp.tile_stack() allocates a cooperative thread-block stack in shared memory, with wp.tile_stack_push(), wp.tile_stack_pop(), wp.tile_stack_clear(), and wp.tile_stack_count() operations (#1287). The stack lets all threads in a block contribute or consume entries without manual atomic-counter management. wp.tile_stack_push() accepts a has_value flag so threads can opt out of pushing while staying in the cooperative call, and wp.tile_stack_pop() returns a (value, slot) pair with slot == -1 for threads that did not get an element (for example, when the stack runs dry).

import warp as wp

BLOCK = 8
CAP = wp.constant(8)


@wp.kernel
def compact(data: wp.array[int], out: wp.array[int], out_count: wp.array[int]):
    _i, j = wp.tid()
    s = wp.tile_stack(capacity=CAP, dtype=int)

    val = data[j]
    wp.tile_stack_push(s, val, val > 5)

    if j == 0:
        out_count[0] = wp.tile_stack_count(s)

    result, slot = wp.tile_stack_pop(s)
    if slot != -1:
        out[slot] = result

Cholesky improvements

wp.tile_cholesky(), wp.tile_cholesky_inplace(), wp.tile_cholesky_solve(), and wp.tile_cholesky_solve_inplace() accept a new fill_mode parameter ("lower" or "upper") for upper Cholesky factorization and solve (#1318). The default lower path reads its input by columns, which hits a known slowdown at power-of-2 tile sizes. The new fill_mode="upper" path reads by rows and avoids that cliff, running roughly 1.2x to 1.7x faster on power-of-2 tiles in microbenchmarks on an RTX PRO 6000 Blackwell. Out-of-place tile_cholesky factorization also gains a backward pass, so it can be used as part of a differentiable solve (#1316).

import warp as wp

TILE = 64

@wp.kernel
def factor(A: wp.array2d[float]):
    a = wp.tile_load(A, shape=(TILE, TILE))
    L = wp.tile_cholesky(a, fill_mode="upper")
    wp.tile_store(A, L)

Multi-dimensional tile FFTs

wp.tile_fft() and wp.tile_ifft() now accept tiles of rank N >= 2 instead of being limited to 2-D, computing the FFT along the last dimension and treating any leading dimensions as independent batches (#1317). A tile of shape (B1, B2, N) or (B1, B2, B3, N) no longer needs to be reshaped to a 2-D (B, N) tile before transforming. The other constraints carried over from the 2-D path are unchanged: the input must be a register-storage tile of wp.vec2f or wp.vec2d (interpreted as complex pairs), and the FFT length (the last dimension) must be at least 2 * block_dim.

Faster tile load and store

wp.tile_load() and wp.tile_store() get two new fast paths without any kernel changes (#1236):

Aligned 3-D and 4-D tiles reach the same fast shared-memory load/store path that previously applied only to 2-D, raising tile bandwidth on higher-rank kernels when the alignment, contiguity, and in-bounds requirements are met.
Tiles of large element types like wp.mat33, wp.mat44, or wp.mat66 get a new shared-memory copy path that replaces the previous per-element loop, restoring the bandwidth that the old path lost on multi-byte elements. The bigger the element, the bigger the speedup.

A new aligned parameter on both functions skips runtime alignment checks when the caller can guarantee 16-byte alignment, contiguity, and in-bounds access:

- wp.tile_load(arr, shape=N, offset=i * N)
+ wp.tile_load(arr, shape=N, offset=i * N, aligned=True)

Tile parameters by reference in `@wp.func_native`

@wp.func_native snippets now receive tile parameters by reference instead of by value, matching the behavior of @wp.func (#1362). Native snippets can therefore modify shared tiles in place rather than receiving a local copy that gets discarded on return.

`warp.fem` double-precision support

warp.fem gains end-to-end wp.float64 support (#418). Precision is selected on the geometry (e.g. scalar_type=wp.float64 on grid constructors) and propagates automatically to function spaces, quadrature, fields, and integration kernels. Existing wp.float32 setups are unchanged.

import warp as wp
from warp.fem import Grid2D

geo = Grid2D(res=(64, 64), bounds_lo=wp.vec2(0.0), bounds_hi=wp.vec2(1.0), scalar_type=wp.float64)
# spaces, quadrature, integrate(...) calls built on `geo` now use float64 throughout.

The default output_dtype of warp.fem.integrate() now follows the geometry's scalar type (wp.float32 or wp.float64) instead of the accumulate_dtype. See Breaking changes for the migration.

Compilation and tooling

Faster CPU compile times

CPU kernel compile times in multi-module workloads drop substantially via precompiled-header support, controlled by the existing warp.config.use_precompiled_headers setting (#595). Repeated module compiles in a session no longer re-parse the bundled headers on every call.

CPU kernels target host CPU features by default

CPU kernels now compile with -march=native by default, so generated code automatically picks up the wider SIMD instruction sets (AVX2, AVX-512, NEON variants) available on the build host instead of targeting a generic x86-64 / aarch64 baseline (#1308). Existing kernels do not need any change to benefit.

The follow-on implication is that AOT-compiled CPU modules and shared CPU kernel caches are now host-specific. wp.compile_aot_module() emits a warning when it produces a CPU module under the new default, because loading that module on a host with a narrower CPU feature set would fail with an illegal-instruction crash. Set wp.config.cpu_compiler_flags = "" before compiling to opt back into a portable baseline build. Cached CPU .o filenames also include a short host-ISA hash (e.g. module.cpu1a2b3c4d.o), so heterogeneous CI runners and shared kernel caches no longer cross-load incompatible binaries.

Inline module options on `@wp.kernel`

@wp.kernel(module="unique", module_options={...}) accepts a dict of module-level compilation options inline (#1250), avoiding the previous pattern of toggling warp.config.* globally before defining the kernel. One useful application is disabling cuBLASDx for wp.tile_matmul during development to skip its slow LTO compile step:

import warp as wp

M, N, K = 16, 16, 16


@wp.kernel(module="unique", module_options={"enable_mathdx_gemm": False})
def matmul(a: wp.array2d[float], b: wp.array2d[float], c: wp.array2d[float]):
    ta = wp.tile_load(a, shape=(M, K))
    tb = wp.tile_load(b, shape=(K, N))
    tc = wp.tile_zeros(dtype=float, shape=(M, N))
    wp.tile_matmul(ta, tb, tc)
    wp.tile_store(c, tc)

On a clean kernel cache the kernel above compiles in roughly a second, instead of the tens of seconds the default cuBLASDx-backed path can take.

Other improvements

Optimization-level config now applies to the full CPU compilation pipeline (CPU defaults remain -O2, CUDA defaults remain -O3) (#1310).
Custom kernel-cache paths now include the Warp version, so artifacts from a previous version stop interfering after an upgrade (#1260).
Module hashes ignore the difference between an unset config option (None) and its explicit default, avoiding spurious recompilation (#1307).
wp.get_suggested_block_size(kernel) queries the CUDA driver's occupancy API for a launch configuration that maximizes per-SM occupancy (#1270). Returns (block_size, min_grid_size). Pass block_size to wp.launch(..., block_dim=...).

Language enhancements

Short-circuit and / or (#1329): chained boolean operators in kernels now use Python semantics, so guards like if arr and arr[i] == 0 no longer crash from eagerly evaluating the right-hand side.
Typed composite constructors accept literals (#1297): wp.vec3d(), wp.mat22h(), wp.quatd(), etc. now accept Python scalar literals directly without an explicit cast, preserving precision.
wp.indexedarray fields in @wp.struct (#1327): structs may now hold wp.indexedarray fields, with assignment, device transfer, and NumPy structured-value support working the same way as wp.array fields.
wp.Volume anisotropic voxels (#1193): wp.Volume.load_from_numpy() and wp.Volume.allocate() accept a 3-element sequence for voxel_size, enabling anisotropic voxel spacing.

Breaking changes

`tape.backward()` zeroes intermediate gradient buffers (#1062)

tape.backward() now zeroes the gradient buffer of any array written to in the forward pass, matching PyTorch's behavior for intermediate gradients. This is a correctness fix: previously, calling tape.backward() multiple times accumulated stale .grad from the prior call into the new pass.

# v1.12 behavior: out.grad still holds the upstream seed after backward()
out = wp.zeros(N, dtype=float, requires_grad=True)
with wp.Tape() as tape:
    wp.launch(forward_kernel, dim=N, inputs=[x], outputs=[out])
out.grad = wp.ones_like(out)
tape.backward()
# v1.12: out.grad == ones (still the seed)
# v1.13: out.grad == zeros (cleared during backward)

# v1.13 migration: pass fresh upstream gradients each backward call
with wp.Tape() as tape:
    wp.launch(forward_kernel, dim=N, inputs=[x], outputs=[out])
tape.backward(grads={out: wp.ones_like(out)})

# Or opt out of the zeroing for arrays you want to inspect after backward.
# Only safe when each element is written at most once per forward pass.
out = wp.array(shape=N, dtype=float, requires_grad=True, retain_grad=True)

Backward passes of kernels with many per-element array writes (e.g. matrix component assignments) may be slower because of the additional zeroing. See the differentiability guide for the full rationale.

`warp.fem.integrate()` `output_dtype` default change (#418)

The default output_dtype of warp.fem.integrate() now follows the geometry's scalar type instead of accumulate_dtype (which itself defaults to wp.float64). For wp.float32 geometries this changes the default output from wp.float64 to wp.float32.

- result = warp.fem.integrate(form, fields=fields)            # v1.12: float64
+ result = warp.fem.integrate(form, fields=fields, output_dtype=wp.float64)  # explicit

Announcements

Release cadence

Warp now aims to publish a feature release every month. This replaces the previous schedule that alternated monthly between feature and bugfix releases. Bugfix releases are no longer regularly scheduled. They are issued ad hoc, only when an important issue cannot wait for the next feature release, and only against the most recent feature release line. See the Compatibility & Support page for the full versioning, deprecation, and support policy.

Removals in this release

Python 3.9 is no longer supported. Python 3.10 is now the minimum required version (#1263). Update to Python 3.10 or newer to continue receiving Warp updates.
Private-API forwarding shims removed. The deprecation cycle started in v1.11 is finalized: code that imported shim module paths (warp.torch, warp.context, etc.), used warp.mat / warp.vec, referenced warp.context.Devicelike, or relied on the Module.foo -> Module._foo proxy now raises rather than warning (#1352). Use the curated public API in the warp namespace. The API documentation is the source of truth for what is public.
wp.isfinite(), wp.isnan(), and wp.isinf() no longer accept integer types. These functions now accept floating-point arguments only, finalizing the deprecation announced in v1.11 (#847). Drop integer call sites or wrap the operand in a float cast.

Upcoming removals

Texture copy_from_array() / copy_to_array() deprecated. Use the new copy_from() / copy_to() methods instead (#1238). Calling the deprecated names emits a DeprecationWarning. They will be removed in a future feature release per the standard deprecation timeline.

Acknowledgments

We also thank the following contributors from outside the core Warp development team:

@FabienPean-Virtonomy for adding wp.indexedarray field support in @wp.struct, with assignment, device transfer, and NumPy structured-value handling (#1327).
@Glitchfix for adding anisotropic voxel-size support to wp.Volume.load_from_numpy() and wp.Volume.allocate(), including the shared _normalize_voxel_size() validation helper and tests (#1193).

For a complete list of changes, see the full changelog.

v1.13.0