Skip to content

bsr_axpy: unconditional reallocation defeats pre-allocated sparsity patterns #1337

@jf---

Description

@jf---

Environment

  • warp 1.12.0 (pixi, CUDA 13.1)
  • RTX A6000 (48 GB), RTX 3090 (24 GB)
  • AdaptiveNanogrid, Q1 vec3 space, hex8 elements

Context

Follow-up to #1333. Batched integration (splitting elements into Subdomain batches, accumulating via output=K, add=True) successfully caps COO triplet memory. However, the bsr_axpy called internally by _launch_integrate_kernel at integrate.py:1657 (output += bsr_result) unconditionally reallocates, negating the pre-allocated sparsity pattern.

Problem

bsr_axpy (sparse.py:1649) computes:

sum_nnz = x_nnz + y_nnz
_bsr_ensure_fits(y, nnz=sum_nnz)

This is an upper bound assuming zero overlap between sparsity patterns. When the output BSR already contains the full sparsity (e.g., pre-allocated via a cheap symbolic pass), sum_nnz = 2 × actual_nnz. At scale (6M hex8 elements, vec3 elasticity, mat33 blocks), actual_nnz ≈ 370M, so bsr_axpy tries to allocate 740M × 36 bytes ≈ 26.6 GB — OOM on 48 GB cards where the actual matrix (13.3 GB) fits.

Reproduction

# Pre-allocate BSR with full sparsity (via scalar integration)
K = precompute_bsr_pattern(space, domain, block_type, device)
K.values.zero_()

# Batched integration — each batch adds into pre-allocated K
for batch in element_batches:
    fem.integrate(integrand, fields=..., output=K, add=True, ...)
    # ^ internally: bsr_result = bsr_zeros(...); bsr_set_from_triplets(bsr_result, ...); K += bsr_result
    # The K += bsr_result triggers _bsr_ensure_fits(K, nnz=K.nnz + bsr_result.nnz)
    # Even though K already has room for all entries

Expected behavior

When y.values.size >= sum_nnz (or better: when the actual merged nnz fits), bsr_axpy should skip reallocation. The check at sparse.py:402-404:

if bsr.values.size < nnz:
    bsr.values = wp.empty(...)

would handle this if sum_nnz were the actual merged nnz instead of the upper bound. But sum_nnz = x_nnz + y_nnz overestimates when patterns overlap.

Suggested fix

Option A (minimal): check y.values.size before computing sum_nnz:

if y.values.size >= x_nnz + y_nnz:
    sum_nnz = y.nnz  # y already has room, skip realloc
else:
    sum_nnz = x_nnz + y_nnz

Option B (correct): compute actual merged nnz via symbolic merge before allocating. bsr_mm_count_coeffs or a dedicated bsr_axpy_count could determine the true nnz.

Impact

This is the second half of the memory wall identified in #1333. Batched integration caps COO triplet memory (the first half), but bsr_axpy reallocation during accumulation prevents pre-allocated sparsity from working. Together, these two issues block large-scale adaptive FEM on consumer GPUs (24-48 GB).

With both fixes, a 6M-element vec3 problem (~13 GB BSR) would fit on a 48 GB card. Currently it OOMs even though the final matrix fits.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions