-
Notifications
You must be signed in to change notification settings - Fork 423
Description
What would you like to report?
Dear fairchem team,
I am running MD simulations in ASE using the “uma-s-1.pt” model for a system of about 70,000 atoms. Although the simulation is slow, it can still run successfully. However, when I try to accelerate the simulation using LAMMPS + fairchem with a single GPU (H100, 98 GB) via the command:
lmp_fc
lmp_in=in.lammps
task_name=omol
local_predict_unit.path.model_name=uma-s-1
I encounter the following error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.20 GiB.
GPU 0 has a total capacity of 93.10 GiB of which 18.03 GiB is free.
Including non-PyTorch memory, this process has 75.06 GiB memory in use.
Of the allocated memory 56.39 GiB is allocated by PyTorch,
and 18.01 GiB is reserved by PyTorch but unallocated.
The LAMMPS integrator continues to advance the MD simulation, and I can still obtain trajectory files. I think the simulation is non-physical because the UMA predictor clearly fails due to CUDA OOM and therefore should not be returning valid energies and forces.
My questions are therefore:
- Is this behavior normal?
- Is it expected that LAMMPS + fairchem can consume more GPU memory than ASE for large systems?
- Are there any recommended strategies to reduce GPU memory usage or otherwise improve performance in this setup?
I would be extremely grateful for any insights or suggestions you could provide.
Below is my LAMMPS input file:
Units and Dimensions
units metal
dimension 3
boundary p p p
atom_style atomic
atom_modify map yes
newton on
read_data npt_1bar.data
velocity all create 300.0 42 mom yes rot yes dist gaussian
Neighbor List Settings
neighbor 2.0 bin
neigh_modify delay 0 every 10 check yes
Time Step
timestep 0.001
thermo_style custom step temp pe ke etotal press vol lx ly lz
thermo_modify flush yes
thermo 1
dump 1 all custom 1 traj_npt.lammpstrj id type x y z vx vy vz
dump_modify 1 sort id
fix 1 all npt temp 300.0 300.0 0.1 iso 1.0 1.0 1.0 tchain 3 pchain 3
restart 1000 restart.*.eq
reset_timestep 0
run 100000