-
Notifications
You must be signed in to change notification settings - Fork 447
Description
Description
Add support for capturing Warp computation graphs (kernel launches and memory operations), serializing them to a portable binary format, and replaying them later -- including from standalone C++ applications without the Python runtime.
This feature, called APIC (API Capture), extends Warp's existing CUDA graph infrastructure with three new capabilities:
- Capture -- Record kernel launches, memory copies, and memsets during
wp.ScopedCapturewith full parameter and memory-region metadata. - Serialize -- Save the captured graph to a
.wgf(Warp Graph File) binary format alongside companion.cubinmodule files. - Replay -- Load and execute a serialized graph on a compatible GPU, with named input/output bindings for supplying new data without rebuilding the graph.
Python API
import warp as wp
# 1. Capture with APIC enabled (default)
with wp.ScopedCapture(apic=True) as capture:
wp.launch(my_kernel, dim=n, inputs=[positions], outputs=[results])
# 2. Save with named parameter bindings
wp.capture_save(capture.graph, "my_computation",
inputs={"positions": positions},
outputs={"results": results})
# 3. Load and execute later (no original Python program needed)
loaded = wp.capture_load("my_computation")
loaded.set_param("positions", new_positions)
wp.capture_launch(loaded)New public Python APIs:
wp.capture_save(graph, path, inputs=None, outputs=None)-- serialize a captured graphwp.capture_load(path, device=None)-- load a serialized graphwp.handletype -- subclass ofuint64for automatic pointer-remapping detection
C API (for native embedding)
APICGraph graph = wp_apic_load_graph(cuda_context, "my_computation.wgf");
wp_apic_set_param(graph, "positions", host_data, size);
cudaGraphLaunch(wp_apic_get_cuda_graph_exec(graph), stream);
wp_apic_destroy_graph(graph);Additional C API functions: wp_apic_get_param_ptr(), wp_apic_get_num_params(), wp_apic_get_param_name(), wp_apic_get_param_size().
Context
Several use cases motivate this feature:
- Deployment without Python -- Simulation or inference pipelines authored in Warp can be exported and loaded by a lightweight C++ runtime, removing the Python dependency at deployment time.
- Cross-process replay -- A captured graph can be saved by one process (or machine) and replayed by another, as long as the GPU architecture matches.
- Caching and reproducibility -- Serialized graphs capture the exact sequence of operations and compiled kernels, enabling deterministic replay.
- Native application integration -- Game engines, robotics stacks, and other C++ applications can embed Warp computations via the C API without linking against the Python interpreter.
Design Overview
Key design decisions:
| Decision | Rationale |
|---|---|
Custom binary format (.wgf) with packed C structs |
Compact, no JSON dependency in C++, version-tolerant via section table |
Separate .cubin files per module in a _modules/ directory |
Matches Warp's one-module-many-kernels compilation model; standard CUDA binary format |
| Type-agnostic byte-size approach for array regions | Avoids complex type reflection; works with arbitrary vec/mat/struct types |
| Graph reconstruction via capture-replay pattern | Simpler than tracking individual graph nodes; uses standard CUDA capture APIs |
wp.handle type for automatic pointer remapping |
Allows APIC to detect which parameters/struct fields need fixup when objects like wp.Mesh are recreated on load |
| Memcpy-based parameter updates on the C++ path | No graph rebuild needed for changing input/output data; efficient for frequent updates |
File Format
my_graph.wgf # Binary: header + metadata + memory + operations
my_graph_modules/ # One .cubin per Warp module
simulation_abc12345.cubin
rendering_def67890.cubin
The .wgf header uses magic WGF1, with sections for metadata, memory regions, and operations. Memory regions track base allocations and handle array aliasing (slices sharing the same underlying memory).
Scope
Included in this feature (Phases 1-4 + Mesh)
- Capture kernel launches during CUDA graph recording
- Capture memory operations (memcpy, memset, allocations)
- Serialize captured graph to
.wgfbinary format - Deserialize and recreate graph from
.wgf - Execute deserialized graph via
wp.capture_launch() - Serialize
wp.arraymemory with aliasing/slicing support - Serialize compiled CUDA kernels (CUBIN as separate files)
- Input/output bindings with named parameters
-
wp.Meshserialization and handle remapping - Array slicing/aliasing (same underlying memory)
- C++ loading API (
wp_apic_load_graph,wp_apic_set_param, etc.) - Python and C++ test coverage (19+ tests)
- C++ example (
02_apic_visualization/)
Files changed (~32 files, ~9,000 lines)
| Area | Files |
|---|---|
| Core Python | warp/_src/apic/__init__.py, capture.py, serialize.py |
| Native C++/CUDA | warp/native/apic.h, apic_types.h, apic.cu |
| Modified existing | context.py, types.py, utils.py, warp.cu, warp.h, warp.cpp, mesh.cpp |
| Tests | warp/tests/cuda/test_apic.py (19 tests), test_apic_mesh.py |
Deferred / future work
-
wp.capture_func()convenience API - Standalone C++ header/source generation for embedding without Warp runtime
-
wp.Volumeandwp.BVHserialization - Conditional graphs (
wp.capture_if()/wp.capture_while()) - Multi-GPU graph support
- Cross-architecture portability (store PTX alongside CUBIN)
- Graph visualization / debugging tools
Acceptance Criteria
wp.capture_save()produces a valid.wgffile and companion_modules/directory from anywp.ScopedCapture(apic=True)graph.wp.capture_load()loads the.wgffile and produces aGraphthat executes correctly viawp.capture_launch(), matching the original computation's results.- Named input/output bindings allow supplying new array data to a loaded graph.
wp.Meshobjects referenced by captured graphs are automatically serialized and recreated on load, with handle pointers remapped transparently.- Array aliasing (slices sharing memory) is handled correctly -- base allocations are serialized once and views are reconstructed with proper offsets.
- The C API works from standalone C++ code linked against the Warp native library.
- All new code has test coverage; existing
wp.capture_*APIs remain backward-compatible.