Support async loading in `wp.tile_load()`

### Description

Add an `async_load` parameter to `wp.tile_load()` that uses CUDA's `cp.async` DMA pipeline to overlap global→shared memory tile loads with compute on previously loaded tiles. This enables double-buffering patterns where the hardware DMA engine loads the next tile while threads process the current one.

**API:**
```python
# Async load — DMA runs in background, sync is automatic on first read
b = wp.tile_load(arr, shape=(1, TILE_N), offset=(i, 0), storage="shared", async_load=True)
```

- `async_load=True` requires `storage="shared"` (codegen error otherwise)
- Synchronization is implicit — `ensure_ready()` fires automatically at all read sites (`tile_map`, `tile_store`, `tile_matmul`, `tile_extract`, etc.)
- Silent fallback to synchronous load on pre-Ampere GPUs (`< sm_80`)

**Double-buffer example:**
```python
@wp.kernel
def double_buffered(arr: wp.array2d(dtype=float), out: wp.array2d(dtype=float)):
    a = wp.tile_load(arr, shape=(1, TILE_N), offset=(0, 0), storage="shared")
    for i in range(1, N_ROWS):
        b = wp.tile_load(arr, shape=(1, TILE_N), offset=(i, 0), storage="shared", async_load=True)
        a = wp.tile_map(compute, a)  # process current tile while next loads
        wp.tile_store(out, a, offset=(i - 1, 0))
        a = b
    a = wp.tile_map(compute, a)
    wp.tile_store(out, a, offset=(N_ROWS - 1, 0))
```

### Context

Serial tile-processing workloads (iterating over rows, streaming reductions, etc.) are bottlenecked by load latency — each `tile_load` stalls the thread block until data arrives in shared memory. CUDA's `cp.async` instruction programs a hardware DMA copy that bypasses the register file, freeing threads to execute ALU work concurrently. On an RTX 5090 benchmark (256×256 tiles, `sin(exp(sqrt(x)))` compute), this yields a **1.21x speedup** over synchronous loads.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support async loading in `wp.tile_load()` #1330

Description

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support async loading in wp.tile_load() #1330

Description

Description

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support async loading in `wp.tile_load()` #1330