bsr_axpy: unconditional reallocation defeats pre-allocated sparsity patterns

### Environment
- warp 1.12.0 (pixi, CUDA 13.1)
- RTX A6000 (48 GB), RTX 3090 (24 GB)
- AdaptiveNanogrid, Q1 vec3 space, hex8 elements

### Context

Follow-up to #1333. Batched integration (splitting elements into Subdomain batches, accumulating via `output=K, add=True`) successfully caps COO triplet memory. However, the `bsr_axpy` called internally by `_launch_integrate_kernel` at `integrate.py:1657` (`output += bsr_result`) unconditionally reallocates, negating the pre-allocated sparsity pattern.

### Problem

`bsr_axpy` (`sparse.py:1649`) computes:

```python
sum_nnz = x_nnz + y_nnz
_bsr_ensure_fits(y, nnz=sum_nnz)
```

This is an upper bound assuming zero overlap between sparsity patterns. When the output BSR already contains the full sparsity (e.g., pre-allocated via a cheap symbolic pass), `sum_nnz = 2 × actual_nnz`. At scale (6M hex8 elements, vec3 elasticity, mat33 blocks), `actual_nnz ≈ 370M`, so `bsr_axpy` tries to allocate `740M × 36 bytes ≈ 26.6 GB` — OOM on 48 GB cards where the actual matrix (13.3 GB) fits.

### Reproduction

```python
# Pre-allocate BSR with full sparsity (via scalar integration)
K = precompute_bsr_pattern(space, domain, block_type, device)
K.values.zero_()

# Batched integration — each batch adds into pre-allocated K
for batch in element_batches:
    fem.integrate(integrand, fields=..., output=K, add=True, ...)
    # ^ internally: bsr_result = bsr_zeros(...); bsr_set_from_triplets(bsr_result, ...); K += bsr_result
    # The K += bsr_result triggers _bsr_ensure_fits(K, nnz=K.nnz + bsr_result.nnz)
    # Even though K already has room for all entries
```

### Expected behavior

When `y.values.size >= sum_nnz` (or better: when the actual merged nnz fits), `bsr_axpy` should skip reallocation. The check at `sparse.py:402-404`:

```python
if bsr.values.size < nnz:
    bsr.values = wp.empty(...)
```

would handle this if `sum_nnz` were the actual merged nnz instead of the upper bound. But `sum_nnz = x_nnz + y_nnz` overestimates when patterns overlap.

### Suggested fix

Option A (minimal): check `y.values.size` before computing `sum_nnz`:
```python
if y.values.size >= x_nnz + y_nnz:
    sum_nnz = y.nnz  # y already has room, skip realloc
else:
    sum_nnz = x_nnz + y_nnz
```

Option B (correct): compute actual merged nnz via symbolic merge before allocating. `bsr_mm_count_coeffs` or a dedicated `bsr_axpy_count` could determine the true nnz.

### Impact

This is the second half of the memory wall identified in #1333. Batched integration caps COO triplet memory (the first half), but `bsr_axpy` reallocation during accumulation prevents pre-allocated sparsity from working. Together, these two issues block large-scale adaptive FEM on consumer GPUs (24-48 GB).

With both fixes, a 6M-element vec3 problem (~13 GB BSR) would fit on a 48 GB card. Currently it OOMs even though the final matrix fits.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bsr_axpy: unconditional reallocation defeats pre-allocated sparsity patterns #1337

Environment

Context

Problem

Reproduction

Expected behavior

Suggested fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bsr_axpy: unconditional reallocation defeats pre-allocated sparsity patterns #1337

Description

Environment

Context

Problem

Reproduction

Expected behavior

Suggested fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions