-
Notifications
You must be signed in to change notification settings - Fork 157
Description
ImplicitMPM - GPU memory keeps growing per step
After several minutes of training, the GPU memory usage continuously increases until all available memory is exhausted.
It's unclear to us (people working on this at ETH RSL Lab) why the memory grows gradually instead of triggering an immediate out-of-memory error.
To help reproduce, I have created a minimal example that replicates the behavior.
example_mpm_memory_issue.py
Environment
-
Script:
newton/newton/examples/mpm/example_mpm_memory_issue.py(created and shared in this issue) -
GPU: NVIDIA RTX 3060 (12 GiB)
-
Additional deps (installed via
uv add):- matplotlib
- nvidia-ml-py
- psutil
-
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01 Driver Version: 550.163.01 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3060 Off | [REDACTED] On | N/A |
| 0% 53C P8 17W / 170W | 660MiB / 12288MiB | 5% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A [REDACTED] G /usr/lib/xorg/Xorg 182MiB |
| 0 N/A N/A [REDACTED] C+G ...libexec/gnome-remote-desktop-daemon 104MiB |
| 0 N/A N/A [REDACTED] G /usr/bin/gnome-shell 97MiB |
| 0 N/A N/A [REDACTED] G ...irefox/7177/usr/lib/firefox/firefox 139MiB |
| 0 N/A N/A [REDACTED] G /usr/share/code/code 99MiB |
| 0 N/A N/A [REDACTED] G /usr/bin/gnome-control-center 11MiB |
| 0 N/A N/A [REDACTED] G /usr/bin/gjs-console 4MiB |
+-----------------------------------------------------------------------------------------+
How to reproduce
uv run newton/newton/examples/mpm/example_mpm_memory_issue.py # headless
uv run newton/newton/examples/mpm/example_mpm_memory_issue.py --viewer # optional visualizationWith this example code example_mpm_memory_issue.py
Observations
step 1 cpu= 760.60 MiB gpu= 1278.47 MiB
step 2 cpu= 762.60 MiB gpu= 1502.47 MiB
step 3 cpu= 762.72 MiB gpu= 1630.47 MiB
step 4 cpu= 762.72 MiB gpu= 1694.47 MiB
step 5 cpu= 762.72 MiB gpu= 1726.47 MiB
step 6 cpu= 762.72 MiB gpu= 1822.47 MiB
step 7 cpu= 762.85 MiB gpu= 1920.47 MiB
step 8 cpu= 762.85 MiB gpu= 2016.47 MiB
.......
step 892 cpu= 766.60 MiB gpu= 10880.47 MiB
step 893 cpu= 766.60 MiB gpu= 10912.47 MiB
step 894 cpu= 766.60 MiB gpu= 10880.47 MiB
step 895 cpu= 766.60 MiB gpu= 10880.47 MiB
step 896 cpu= 766.60 MiB gpu= 10912.47 MiB
Warp CUDA error 2: out of memory (in function wp_alloc_device_async, /builds/omniverse/warp/warp/native/warp.cu:657)
Traceback (most recent call last):
File "....../example_mpm_memory_issue.py", line 163, in
main()
File "....../example_mpm_memory_issue.py", line 141, in main
solver.step(state_0, state_1, None, None, DT)
File "....../newton/newton/_src/solvers/implicit_mpm/solver_implicit_mpm.py", line 1666, in step
self._step_impl(state_in, state_out, dt, self._scratchpad)
File "....../newton/newton/_src/solvers/implicit_mpm/solver_implicit_mpm.py", line 2326, in _step_impl
"u": scratch.velocity_trial,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/integrate.py", line 1805, in integrate
return _launch_integrate_kernel(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/integrate.py", line 1476, in _launch_integrate_kernel
triplet_values_temp = cache.borrow_temporary(
^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/cache.py", line 619, in borrow_temporary
Temporary(shape=shape, dtype=dtype, pinned=pinned, device=device, requires_grad=requires_grad)
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/types.py", line 2434, in init
self._init_new(dtype, shape, strides, device, pinned)
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/types.py", line 2829, in _init_new
ptr = allocator.alloc(capacity)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/context.py", line 2706, in alloc
raise RuntimeError(f"Failed to allocate {size_in_bytes} bytes on device '{self.device}'")
RuntimeError: Failed to allocate 147456000 bytes on device 'cuda:0'
Expectations
With a fixed world/particle count, successive timesteps should reuse GPU buffers so memory stays roughly flat.
Actual
Tested with both fixed and sparse grid_type and in both cases the program eventually crashes.
Notes
When PARTICLES_PER_WORLD=4, the memory usage increases up to a certain point, then drops, and starts increasing again when the grid_type is set to sparse. In contrast, with grid_type=fixed, the memory usage remains constant after a first increase.
