ImplicitMPM - GPU memory keeps growing per step

### ImplicitMPM - GPU memory keeps growing per step

After several minutes of training, the GPU memory usage continuously increases until all available memory is exhausted.

It's unclear to us (people working on this at ETH RSL Lab) why the memory grows gradually instead of triggering an immediate out-of-memory error.
To help reproduce, I have created a minimal example that replicates the behavior.
[example_mpm_memory_issue.py](https://github.com/user-attachments/files/23393659/example_mpm_memory_issue.py)

**Environment**
- Script: `newton/newton/examples/mpm/example_mpm_memory_issue.py` (created and shared in this issue)
- GPU: NVIDIA RTX 3060 (12 GiB)
- Additional deps (installed via `uv add`):
  - matplotlib
  - nvidia-ml-py
  - psutil
  
- nvidia-smi
```  
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   [REDACTED]  On |                  N/A |
|  0%   53C    P8             17W /  170W |     660MiB /  12288MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   [REDACTED]      G   /usr/lib/xorg/Xorg                            182MiB |
|    0   N/A  N/A   [REDACTED]    C+G   ...libexec/gnome-remote-desktop-daemon        104MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/bin/gnome-shell                           97MiB |
|    0   N/A  N/A   [REDACTED]      G   ...irefox/7177/usr/lib/firefox/firefox        139MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/share/code/code                           99MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/bin/gnome-control-center                  11MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/bin/gjs-console                            4MiB |
+-----------------------------------------------------------------------------------------+
```

**How to reproduce**
```bash
uv run newton/newton/examples/mpm/example_mpm_memory_issue.py          # headless
uv run newton/newton/examples/mpm/example_mpm_memory_issue.py --viewer # optional visualization
```

With this example code [example_mpm_memory_issue.py](https://github.com/user-attachments/files/23393659/example_mpm_memory_issue.py)


### Observations
step 1 cpu= 760.60 MiB gpu= 1278.47 MiB
step 2 cpu= 762.60 MiB gpu= 1502.47 MiB
step 3 cpu= 762.72 MiB gpu= 1630.47 MiB
step 4 cpu= 762.72 MiB gpu= 1694.47 MiB
step 5 cpu= 762.72 MiB gpu= 1726.47 MiB
step 6 cpu= 762.72 MiB gpu= 1822.47 MiB
step 7 cpu= 762.85 MiB gpu= 1920.47 MiB
step 8 cpu= 762.85 MiB gpu= 2016.47 MiB

.......
step 892 cpu= 766.60 MiB gpu= 10880.47 MiB
step 893 cpu= 766.60 MiB gpu= 10912.47 MiB
step 894 cpu= 766.60 MiB gpu= 10880.47 MiB
step 895 cpu= 766.60 MiB gpu= 10880.47 MiB
step 896 cpu= 766.60 MiB gpu= 10912.47 MiB
Warp CUDA error 2: out of memory (in function wp_alloc_device_async, /builds/omniverse/warp/warp/native/warp.cu:657)
Traceback (most recent call last):
File "....../example_mpm_memory_issue.py", line 163, in <module>
main()
File "....../example_mpm_memory_issue.py", line 141, in main
solver.step(state_0, state_1, None, None, DT)
File "....../newton/newton/_src/solvers/implicit_mpm/solver_implicit_mpm.py", line 1666, in step
self._step_impl(state_in, state_out, dt, self._scratchpad)
File "....../newton/newton/_src/solvers/implicit_mpm/solver_implicit_mpm.py", line 2326, in _step_impl
"u": scratch.velocity_trial,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/integrate.py", line 1805, in integrate
return _launch_integrate_kernel(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/integrate.py", line 1476, in _launch_integrate_kernel
triplet_values_temp = cache.borrow_temporary(
^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/cache.py", line 619, in borrow_temporary
Temporary(shape=shape, dtype=dtype, pinned=pinned, device=device, requires_grad=requires_grad)
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/types.py", line 2434, in init
self._init_new(dtype, shape, strides, device, pinned)
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/types.py", line 2829, in _init_new
ptr = allocator.alloc(capacity)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/context.py", line 2706, in alloc
raise RuntimeError(f"Failed to allocate {size_in_bytes} bytes on device '{self.device}'")
RuntimeError: Failed to allocate 147456000 bytes on device 'cuda:0'


### Expectations
With a fixed world/particle count, successive timesteps should reuse GPU buffers so memory stays roughly flat.

### Actual
Tested with both `fixed` and `sparse` grid_type and in both cases the program eventually crashes. 

### Notes
When `PARTICLES_PER_WORLD=4`, the memory usage increases up to a certain point, then drops, and starts increasing again when the `grid_type` is set to `sparse`. In contrast, with `grid_type=fixed`, the memory usage remains constant after a first increase.

<img width="960" height="720" alt="Image" src="https://github.com/user-attachments/assets/54dd7dc6-ee7a-4224-8398-852c0c9d9237" />

<img width="960" height="720" alt="Image" src="https://github.com/user-attachments/assets/9e860cfd-bf59-4e00-b9c7-99f9f21289c1" />


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ImplicitMPM - GPU memory keeps growing per step #1042