Skip to content

Support async loading in wp.tile_load() #1330

@daedalus5

Description

@daedalus5

Description

Add an async_load parameter to wp.tile_load() that uses CUDA's cp.async DMA pipeline to overlap global→shared memory tile loads with compute on previously loaded tiles. This enables double-buffering patterns where the hardware DMA engine loads the next tile while threads process the current one.

API:

# Async load — DMA runs in background, sync is automatic on first read
b = wp.tile_load(arr, shape=(1, TILE_N), offset=(i, 0), storage="shared", async_load=True)
  • async_load=True requires storage="shared" (codegen error otherwise)
  • Synchronization is implicit — ensure_ready() fires automatically at all read sites (tile_map, tile_store, tile_matmul, tile_extract, etc.)
  • Silent fallback to synchronous load on pre-Ampere GPUs (< sm_80)

Double-buffer example:

@wp.kernel
def double_buffered(arr: wp.array2d(dtype=float), out: wp.array2d(dtype=float)):
    a = wp.tile_load(arr, shape=(1, TILE_N), offset=(0, 0), storage="shared")
    for i in range(1, N_ROWS):
        b = wp.tile_load(arr, shape=(1, TILE_N), offset=(i, 0), storage="shared", async_load=True)
        a = wp.tile_map(compute, a)  # process current tile while next loads
        wp.tile_store(out, a, offset=(i - 1, 0))
        a = b
    a = wp.tile_map(compute, a)
    wp.tile_store(out, a, offset=(N_ROWS - 1, 0))

Context

Serial tile-processing workloads (iterating over rows, streaming reductions, etc.) are bottlenecked by load latency — each tile_load stalls the thread block until data arrives in shared memory. CUDA's cp.async instruction programs a hardware DMA copy that bypasses the register file, freeing threads to execute ALU work concurrently. On an RTX 5090 benchmark (256×256 tiles, sin(exp(sqrt(x))) compute), this yields a 1.21x speedup over synchronous loads.

Metadata

Metadata

Assignees

Labels

feature requestRequest for something to be added

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions