Skip to content

ImplicitMPM - GPU memory keeps growing per step #1042

@Cucchi01

Description

@Cucchi01

ImplicitMPM - GPU memory keeps growing per step

After several minutes of training, the GPU memory usage continuously increases until all available memory is exhausted.

It's unclear to us (people working on this at ETH RSL Lab) why the memory grows gradually instead of triggering an immediate out-of-memory error.
To help reproduce, I have created a minimal example that replicates the behavior.
example_mpm_memory_issue.py

Environment

  • Script: newton/newton/examples/mpm/example_mpm_memory_issue.py (created and shared in this issue)

  • GPU: NVIDIA RTX 3060 (12 GiB)

  • Additional deps (installed via uv add):

    • matplotlib
    • nvidia-ml-py
    • psutil
  • nvidia-smi

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.163.01             Driver Version: 550.163.01     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   [REDACTED]  On |                  N/A |
|  0%   53C    P8             17W /  170W |     660MiB /  12288MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   [REDACTED]      G   /usr/lib/xorg/Xorg                            182MiB |
|    0   N/A  N/A   [REDACTED]    C+G   ...libexec/gnome-remote-desktop-daemon        104MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/bin/gnome-shell                           97MiB |
|    0   N/A  N/A   [REDACTED]      G   ...irefox/7177/usr/lib/firefox/firefox        139MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/share/code/code                           99MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/bin/gnome-control-center                  11MiB |
|    0   N/A  N/A   [REDACTED]      G   /usr/bin/gjs-console                            4MiB |
+-----------------------------------------------------------------------------------------+

How to reproduce

uv run newton/newton/examples/mpm/example_mpm_memory_issue.py          # headless
uv run newton/newton/examples/mpm/example_mpm_memory_issue.py --viewer # optional visualization

With this example code example_mpm_memory_issue.py

Observations

step 1 cpu= 760.60 MiB gpu= 1278.47 MiB
step 2 cpu= 762.60 MiB gpu= 1502.47 MiB
step 3 cpu= 762.72 MiB gpu= 1630.47 MiB
step 4 cpu= 762.72 MiB gpu= 1694.47 MiB
step 5 cpu= 762.72 MiB gpu= 1726.47 MiB
step 6 cpu= 762.72 MiB gpu= 1822.47 MiB
step 7 cpu= 762.85 MiB gpu= 1920.47 MiB
step 8 cpu= 762.85 MiB gpu= 2016.47 MiB

.......
step 892 cpu= 766.60 MiB gpu= 10880.47 MiB
step 893 cpu= 766.60 MiB gpu= 10912.47 MiB
step 894 cpu= 766.60 MiB gpu= 10880.47 MiB
step 895 cpu= 766.60 MiB gpu= 10880.47 MiB
step 896 cpu= 766.60 MiB gpu= 10912.47 MiB
Warp CUDA error 2: out of memory (in function wp_alloc_device_async, /builds/omniverse/warp/warp/native/warp.cu:657)
Traceback (most recent call last):
File "....../example_mpm_memory_issue.py", line 163, in
main()
File "....../example_mpm_memory_issue.py", line 141, in main
solver.step(state_0, state_1, None, None, DT)
File "....../newton/newton/_src/solvers/implicit_mpm/solver_implicit_mpm.py", line 1666, in step
self._step_impl(state_in, state_out, dt, self._scratchpad)
File "....../newton/newton/_src/solvers/implicit_mpm/solver_implicit_mpm.py", line 2326, in _step_impl
"u": scratch.velocity_trial,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/integrate.py", line 1805, in integrate
return _launch_integrate_kernel(
^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/integrate.py", line 1476, in _launch_integrate_kernel
triplet_values_temp = cache.borrow_temporary(
^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/fem/cache.py", line 619, in borrow_temporary
Temporary(shape=shape, dtype=dtype, pinned=pinned, device=device, requires_grad=requires_grad)
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/types.py", line 2434, in init
self._init_new(dtype, shape, strides, device, pinned)
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/types.py", line 2829, in _init_new
ptr = allocator.alloc(capacity)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "....../newton/.venv/lib/python3.11/site-packages/warp/_src/context.py", line 2706, in alloc
raise RuntimeError(f"Failed to allocate {size_in_bytes} bytes on device '{self.device}'")
RuntimeError: Failed to allocate 147456000 bytes on device 'cuda:0'

Expectations

With a fixed world/particle count, successive timesteps should reuse GPU buffers so memory stays roughly flat.

Actual

Tested with both fixed and sparse grid_type and in both cases the program eventually crashes.

Notes

When PARTICLES_PER_WORLD=4, the memory usage increases up to a certain point, then drops, and starts increasing again when the grid_type is set to sparse. In contrast, with grid_type=fixed, the memory usage remains constant after a first increase.

Image Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions