-
Notifications
You must be signed in to change notification settings - Fork 477
Support async loading in wp.tile_load() #1330
Description
Description
Add an async_load parameter to wp.tile_load() that uses CUDA's cp.async DMA pipeline to overlap global→shared memory tile loads with compute on previously loaded tiles. This enables double-buffering patterns where the hardware DMA engine loads the next tile while threads process the current one.
API:
# Async load — DMA runs in background, sync is automatic on first read
b = wp.tile_load(arr, shape=(1, TILE_N), offset=(i, 0), storage="shared", async_load=True)async_load=Truerequiresstorage="shared"(codegen error otherwise)- Synchronization is implicit —
ensure_ready()fires automatically at all read sites (tile_map,tile_store,tile_matmul,tile_extract, etc.) - Silent fallback to synchronous load on pre-Ampere GPUs (
< sm_80)
Double-buffer example:
@wp.kernel
def double_buffered(arr: wp.array2d(dtype=float), out: wp.array2d(dtype=float)):
a = wp.tile_load(arr, shape=(1, TILE_N), offset=(0, 0), storage="shared")
for i in range(1, N_ROWS):
b = wp.tile_load(arr, shape=(1, TILE_N), offset=(i, 0), storage="shared", async_load=True)
a = wp.tile_map(compute, a) # process current tile while next loads
wp.tile_store(out, a, offset=(i - 1, 0))
a = b
a = wp.tile_map(compute, a)
wp.tile_store(out, a, offset=(N_ROWS - 1, 0))Context
Serial tile-processing workloads (iterating over rows, streaming reductions, etc.) are bottlenecked by load latency — each tile_load stalls the thread block until data arrives in shared memory. CUDA's cp.async instruction programs a hardware DMA copy that bypasses the register file, freeing threads to execute ALU work concurrently. On an RTX 5090 benchmark (256×256 tiles, sin(exp(sqrt(x))) compute), this yields a 1.21x speedup over synchronous loads.