[REQ] Add APIC: CUDA graph capture, serialization, and replay

### Description

Add support for capturing Warp computation graphs (kernel launches and memory operations), serializing them to a portable binary format, and replaying them later -- including from standalone C++ applications without the Python runtime.

This feature, called **APIC (API Capture)**, extends Warp's existing CUDA graph infrastructure with three new capabilities:

1. **Capture** -- Record kernel launches, memory copies, and memsets during `wp.ScopedCapture` with full parameter and memory-region metadata.
2. **Serialize** -- Save the captured graph to a `.wgf` (Warp Graph File) binary format alongside companion `.cubin` module files.
3. **Replay** -- Load and execute a serialized graph on a compatible GPU, with named input/output bindings for supplying new data without rebuilding the graph.

#### Python API

```python
import warp as wp

# 1. Capture with APIC enabled (default)
with wp.ScopedCapture(apic=True) as capture:
    wp.launch(my_kernel, dim=n, inputs=[positions], outputs=[results])

# 2. Save with named parameter bindings
wp.capture_save(capture.graph, "my_computation",
                inputs={"positions": positions},
                outputs={"results": results})

# 3. Load and execute later (no original Python program needed)
loaded = wp.capture_load("my_computation")
loaded.set_param("positions", new_positions)
wp.capture_launch(loaded)
```

New public Python APIs:
- `wp.capture_save(graph, path, inputs=None, outputs=None)` -- serialize a captured graph
- `wp.capture_load(path, device=None)` -- load a serialized graph
- `wp.handle` type -- subclass of `uint64` for automatic pointer-remapping detection

#### C API (for native embedding)

```c
APICGraph graph = wp_apic_load_graph(cuda_context, "my_computation.wgf");
wp_apic_set_param(graph, "positions", host_data, size);
cudaGraphLaunch(wp_apic_get_cuda_graph_exec(graph), stream);
wp_apic_destroy_graph(graph);
```

Additional C API functions: `wp_apic_get_param_ptr()`, `wp_apic_get_num_params()`, `wp_apic_get_param_name()`, `wp_apic_get_param_size()`.

### Context

Several use cases motivate this feature:

- **Deployment without Python** -- Simulation or inference pipelines authored in Warp can be exported and loaded by a lightweight C++ runtime, removing the Python dependency at deployment time.
- **Cross-process replay** -- A captured graph can be saved by one process (or machine) and replayed by another, as long as the GPU architecture matches.
- **Caching and reproducibility** -- Serialized graphs capture the exact sequence of operations and compiled kernels, enabling deterministic replay.
- **Native application integration** -- Game engines, robotics stacks, and other C++ applications can embed Warp computations via the C API without linking against the Python interpreter.

### Design Overview

Key design decisions:

| Decision | Rationale |
|----------|-----------|
| Custom binary format (`.wgf`) with packed C structs | Compact, no JSON dependency in C++, version-tolerant via section table |
| Separate `.cubin` files per module in a `_modules/` directory | Matches Warp's one-module-many-kernels compilation model; standard CUDA binary format |
| Type-agnostic byte-size approach for array regions | Avoids complex type reflection; works with arbitrary `vec`/`mat`/struct types |
| Graph reconstruction via capture-replay pattern | Simpler than tracking individual graph nodes; uses standard CUDA capture APIs |
| `wp.handle` type for automatic pointer remapping | Allows APIC to detect which parameters/struct fields need fixup when objects like `wp.Mesh` are recreated on load |
| Memcpy-based parameter updates on the C++ path | No graph rebuild needed for changing input/output data; efficient for frequent updates |

#### File Format

```
my_graph.wgf                        # Binary: header + metadata + memory + operations
my_graph_modules/                    # One .cubin per Warp module
    simulation_abc12345.cubin
    rendering_def67890.cubin
```

The `.wgf` header uses magic `WGF1`, with sections for metadata, memory regions, and operations. Memory regions track base allocations and handle array aliasing (slices sharing the same underlying memory).

### Scope

#### Included in this feature (Phases 1-4 + Mesh)

- [x] Capture kernel launches during CUDA graph recording
- [x] Capture memory operations (memcpy, memset, allocations)
- [x] Serialize captured graph to `.wgf` binary format
- [x] Deserialize and recreate graph from `.wgf`
- [x] Execute deserialized graph via `wp.capture_launch()`
- [x] Serialize `wp.array` memory with aliasing/slicing support
- [x] Serialize compiled CUDA kernels (CUBIN as separate files)
- [x] Input/output bindings with named parameters
- [x] `wp.Mesh` serialization and handle remapping
- [x] Array slicing/aliasing (same underlying memory)
- [x] C++ loading API (`wp_apic_load_graph`, `wp_apic_set_param`, etc.)
- [x] Python and C++ test coverage (19+ tests)
- [x] C++ example (`02_apic_visualization/`)

#### Files changed (~32 files, ~9,000 lines)

| Area | Files |
|------|-------|
| Core Python | `warp/_src/apic/__init__.py`, `capture.py`, `serialize.py` |
| Native C++/CUDA | `warp/native/apic.h`, `apic_types.h`, `apic.cu` |
| Modified existing | `context.py`, `types.py`, `utils.py`, `warp.cu`, `warp.h`, `warp.cpp`, `mesh.cpp` |
| Tests | `warp/tests/cuda/test_apic.py` (19 tests), `test_apic_mesh.py` |

#### Deferred / future work

- [ ] `wp.capture_func()` convenience API
- [ ] Standalone C++ header/source generation for embedding without Warp runtime
- [ ] `wp.Volume` and `wp.BVH` serialization
- [ ] Conditional graphs (`wp.capture_if()` / `wp.capture_while()`)
- [ ] Multi-GPU graph support
- [ ] Cross-architecture portability (store PTX alongside CUBIN)
- [ ] Graph visualization / debugging tools

### Acceptance Criteria

1. `wp.capture_save()` produces a valid `.wgf` file and companion `_modules/` directory from any `wp.ScopedCapture(apic=True)` graph.
2. `wp.capture_load()` loads the `.wgf` file and produces a `Graph` that executes correctly via `wp.capture_launch()`, matching the original computation's results.
3. Named input/output bindings allow supplying new array data to a loaded graph.
4. `wp.Mesh` objects referenced by captured graphs are automatically serialized and recreated on load, with handle pointers remapped transparently.
5. Array aliasing (slices sharing memory) is handled correctly -- base allocations are serialized once and views are reconstructed with proper offsets.
6. The C API works from standalone C++ code linked against the Warp native library.
7. All new code has test coverage; existing `wp.capture_*` APIs remain backward-compatible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQ] Add APIC: CUDA graph capture, serialization, and replay #1214

Description

Python API

C API (for native embedding)

Context

Design Overview

File Format

Scope

Included in this feature (Phases 1-4 + Mesh)

Files changed (~32 files, ~9,000 lines)

Deferred / future work

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Decision	Rationale
Custom binary format (`.wgf`) with packed C structs	Compact, no JSON dependency in C++, version-tolerant via section table
Separate `.cubin` files per module in a `_modules/` directory	Matches Warp's one-module-many-kernels compilation model; standard CUDA binary format
Type-agnostic byte-size approach for array regions	Avoids complex type reflection; works with arbitrary `vec`/`mat`/struct types
Graph reconstruction via capture-replay pattern	Simpler than tracking individual graph nodes; uses standard CUDA capture APIs
`wp.handle` type for automatic pointer remapping	Allows APIC to detect which parameters/struct fields need fixup when objects like `wp.Mesh` are recreated on load
Memcpy-based parameter updates on the C++ path	No graph rebuild needed for changing input/output data; efficient for frequent updates

Area	Files
Core Python	`warp/_src/apic/__init__.py`, `capture.py`, `serialize.py`
Native C++/CUDA	`warp/native/apic.h`, `apic_types.h`, `apic.cu`
Modified existing	`context.py`, `types.py`, `utils.py`, `warp.cu`, `warp.h`, `warp.cpp`, `mesh.cpp`
Tests	`warp/tests/cuda/test_apic.py` (19 tests), `test_apic_mesh.py`

[REQ] Add APIC: CUDA graph capture, serialization, and replay #1214

Description

Description

Python API

C API (for native embedding)

Context

Design Overview

File Format

Scope

Included in this feature (Phases 1-4 + Mesh)

Files changed (~32 files, ~9,000 lines)

Deferred / future work

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions