-
Notifications
You must be signed in to change notification settings - Fork 477
bsr_axpy: unconditional reallocation defeats pre-allocated sparsity patterns #1337
Description
Environment
- warp 1.12.0 (pixi, CUDA 13.1)
- RTX A6000 (48 GB), RTX 3090 (24 GB)
- AdaptiveNanogrid, Q1 vec3 space, hex8 elements
Context
Follow-up to #1333. Batched integration (splitting elements into Subdomain batches, accumulating via output=K, add=True) successfully caps COO triplet memory. However, the bsr_axpy called internally by _launch_integrate_kernel at integrate.py:1657 (output += bsr_result) unconditionally reallocates, negating the pre-allocated sparsity pattern.
Problem
bsr_axpy (sparse.py:1649) computes:
sum_nnz = x_nnz + y_nnz
_bsr_ensure_fits(y, nnz=sum_nnz)This is an upper bound assuming zero overlap between sparsity patterns. When the output BSR already contains the full sparsity (e.g., pre-allocated via a cheap symbolic pass), sum_nnz = 2 × actual_nnz. At scale (6M hex8 elements, vec3 elasticity, mat33 blocks), actual_nnz ≈ 370M, so bsr_axpy tries to allocate 740M × 36 bytes ≈ 26.6 GB — OOM on 48 GB cards where the actual matrix (13.3 GB) fits.
Reproduction
# Pre-allocate BSR with full sparsity (via scalar integration)
K = precompute_bsr_pattern(space, domain, block_type, device)
K.values.zero_()
# Batched integration — each batch adds into pre-allocated K
for batch in element_batches:
fem.integrate(integrand, fields=..., output=K, add=True, ...)
# ^ internally: bsr_result = bsr_zeros(...); bsr_set_from_triplets(bsr_result, ...); K += bsr_result
# The K += bsr_result triggers _bsr_ensure_fits(K, nnz=K.nnz + bsr_result.nnz)
# Even though K already has room for all entriesExpected behavior
When y.values.size >= sum_nnz (or better: when the actual merged nnz fits), bsr_axpy should skip reallocation. The check at sparse.py:402-404:
if bsr.values.size < nnz:
bsr.values = wp.empty(...)would handle this if sum_nnz were the actual merged nnz instead of the upper bound. But sum_nnz = x_nnz + y_nnz overestimates when patterns overlap.
Suggested fix
Option A (minimal): check y.values.size before computing sum_nnz:
if y.values.size >= x_nnz + y_nnz:
sum_nnz = y.nnz # y already has room, skip realloc
else:
sum_nnz = x_nnz + y_nnzOption B (correct): compute actual merged nnz via symbolic merge before allocating. bsr_mm_count_coeffs or a dedicated bsr_axpy_count could determine the true nnz.
Impact
This is the second half of the memory wall identified in #1333. Batched integration caps COO triplet memory (the first half), but bsr_axpy reallocation during accumulation prevents pre-allocated sparsity from working. Together, these two issues block large-scale adaptive FEM on consumer GPUs (24-48 GB).
With both fixes, a 6M-element vec3 problem (~13 GB BSR) would fit on a 48 GB card. Currently it OOMs even though the final matrix fits.