Skip to content

[REQ] Add APIC: CUDA graph capture, serialization, and replay #1214

@shi-eric

Description

@shi-eric

Description

Add support for capturing Warp computation graphs (kernel launches and memory operations), serializing them to a portable binary format, and replaying them later -- including from standalone C++ applications without the Python runtime.

This feature, called APIC (API Capture), extends Warp's existing CUDA graph infrastructure with three new capabilities:

  1. Capture -- Record kernel launches, memory copies, and memsets during wp.ScopedCapture with full parameter and memory-region metadata.
  2. Serialize -- Save the captured graph to a .wgf (Warp Graph File) binary format alongside companion .cubin module files.
  3. Replay -- Load and execute a serialized graph on a compatible GPU, with named input/output bindings for supplying new data without rebuilding the graph.

Python API

import warp as wp

# 1. Capture with APIC enabled (default)
with wp.ScopedCapture(apic=True) as capture:
    wp.launch(my_kernel, dim=n, inputs=[positions], outputs=[results])

# 2. Save with named parameter bindings
wp.capture_save(capture.graph, "my_computation",
                inputs={"positions": positions},
                outputs={"results": results})

# 3. Load and execute later (no original Python program needed)
loaded = wp.capture_load("my_computation")
loaded.set_param("positions", new_positions)
wp.capture_launch(loaded)

New public Python APIs:

  • wp.capture_save(graph, path, inputs=None, outputs=None) -- serialize a captured graph
  • wp.capture_load(path, device=None) -- load a serialized graph
  • wp.handle type -- subclass of uint64 for automatic pointer-remapping detection

C API (for native embedding)

APICGraph graph = wp_apic_load_graph(cuda_context, "my_computation.wgf");
wp_apic_set_param(graph, "positions", host_data, size);
cudaGraphLaunch(wp_apic_get_cuda_graph_exec(graph), stream);
wp_apic_destroy_graph(graph);

Additional C API functions: wp_apic_get_param_ptr(), wp_apic_get_num_params(), wp_apic_get_param_name(), wp_apic_get_param_size().

Context

Several use cases motivate this feature:

  • Deployment without Python -- Simulation or inference pipelines authored in Warp can be exported and loaded by a lightweight C++ runtime, removing the Python dependency at deployment time.
  • Cross-process replay -- A captured graph can be saved by one process (or machine) and replayed by another, as long as the GPU architecture matches.
  • Caching and reproducibility -- Serialized graphs capture the exact sequence of operations and compiled kernels, enabling deterministic replay.
  • Native application integration -- Game engines, robotics stacks, and other C++ applications can embed Warp computations via the C API without linking against the Python interpreter.

Design Overview

Key design decisions:

Decision Rationale
Custom binary format (.wgf) with packed C structs Compact, no JSON dependency in C++, version-tolerant via section table
Separate .cubin files per module in a _modules/ directory Matches Warp's one-module-many-kernels compilation model; standard CUDA binary format
Type-agnostic byte-size approach for array regions Avoids complex type reflection; works with arbitrary vec/mat/struct types
Graph reconstruction via capture-replay pattern Simpler than tracking individual graph nodes; uses standard CUDA capture APIs
wp.handle type for automatic pointer remapping Allows APIC to detect which parameters/struct fields need fixup when objects like wp.Mesh are recreated on load
Memcpy-based parameter updates on the C++ path No graph rebuild needed for changing input/output data; efficient for frequent updates

File Format

my_graph.wgf                        # Binary: header + metadata + memory + operations
my_graph_modules/                    # One .cubin per Warp module
    simulation_abc12345.cubin
    rendering_def67890.cubin

The .wgf header uses magic WGF1, with sections for metadata, memory regions, and operations. Memory regions track base allocations and handle array aliasing (slices sharing the same underlying memory).

Scope

Included in this feature (Phases 1-4 + Mesh)

  • Capture kernel launches during CUDA graph recording
  • Capture memory operations (memcpy, memset, allocations)
  • Serialize captured graph to .wgf binary format
  • Deserialize and recreate graph from .wgf
  • Execute deserialized graph via wp.capture_launch()
  • Serialize wp.array memory with aliasing/slicing support
  • Serialize compiled CUDA kernels (CUBIN as separate files)
  • Input/output bindings with named parameters
  • wp.Mesh serialization and handle remapping
  • Array slicing/aliasing (same underlying memory)
  • C++ loading API (wp_apic_load_graph, wp_apic_set_param, etc.)
  • Python and C++ test coverage (19+ tests)
  • C++ example (02_apic_visualization/)

Files changed (~32 files, ~9,000 lines)

Area Files
Core Python warp/_src/apic/__init__.py, capture.py, serialize.py
Native C++/CUDA warp/native/apic.h, apic_types.h, apic.cu
Modified existing context.py, types.py, utils.py, warp.cu, warp.h, warp.cpp, mesh.cpp
Tests warp/tests/cuda/test_apic.py (19 tests), test_apic_mesh.py

Deferred / future work

  • wp.capture_func() convenience API
  • Standalone C++ header/source generation for embedding without Warp runtime
  • wp.Volume and wp.BVH serialization
  • Conditional graphs (wp.capture_if() / wp.capture_while())
  • Multi-GPU graph support
  • Cross-architecture portability (store PTX alongside CUBIN)
  • Graph visualization / debugging tools

Acceptance Criteria

  1. wp.capture_save() produces a valid .wgf file and companion _modules/ directory from any wp.ScopedCapture(apic=True) graph.
  2. wp.capture_load() loads the .wgf file and produces a Graph that executes correctly via wp.capture_launch(), matching the original computation's results.
  3. Named input/output bindings allow supplying new array data to a loaded graph.
  4. wp.Mesh objects referenced by captured graphs are automatically serialized and recreated on load, with handle pointers remapped transparently.
  5. Array aliasing (slices sharing memory) is handled correctly -- base allocations are serialized once and views are reconstructed with proper offsets.
  6. The C API works from standalone C++ code linked against the Warp native library.
  7. All new code has test coverage; existing wp.capture_* APIs remain backward-compatible.

Metadata

Metadata

Assignees

Labels

feature requestRequest for something to be added

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions