Skip to content

ASE can handle systems with up to 70,000 atoms, but fairchem_lammps will report CUDA OOM #1725

@XiaoxianPang

Description

@XiaoxianPang

What would you like to report?

Dear fairchem team,

I am running MD simulations in ASE using the “uma-s-1.pt” model for a system of about 70,000 atoms. Although the simulation is slow, it can still run successfully. However, when I try to accelerate the simulation using LAMMPS + fairchem with a single GPU (H100, 98 GB) via the command:
lmp_fc
lmp_in=in.lammps
task_name=omol
local_predict_unit.path.model_name=uma-s-1

I encounter the following error:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 19.20 GiB.
GPU 0 has a total capacity of 93.10 GiB of which 18.03 GiB is free.
Including non-PyTorch memory, this process has 75.06 GiB memory in use.
Of the allocated memory 56.39 GiB is allocated by PyTorch,
and 18.01 GiB is reserved by PyTorch but unallocated.

The LAMMPS integrator continues to advance the MD simulation, and I can still obtain trajectory files. I think the simulation is non-physical because the UMA predictor clearly fails due to CUDA OOM and therefore should not be returning valid energies and forces.

My questions are therefore:

  1. Is this behavior normal?
  2. Is it expected that LAMMPS + fairchem can consume more GPU memory than ASE for large systems?
  3. Are there any recommended strategies to reduce GPU memory usage or otherwise improve performance in this setup?

I would be extremely grateful for any insights or suggestions you could provide.

Below is my LAMMPS input file:

Units and Dimensions

units metal
dimension 3
boundary p p p
atom_style atomic
atom_modify map yes
newton on

read_data npt_1bar.data
velocity all create 300.0 42 mom yes rot yes dist gaussian

Neighbor List Settings

neighbor 2.0 bin
neigh_modify delay 0 every 10 check yes

Time Step

timestep 0.001

thermo_style custom step temp pe ke etotal press vol lx ly lz
thermo_modify flush yes
thermo 1

dump 1 all custom 1 traj_npt.lammpstrj id type x y z vx vy vz
dump_modify 1 sort id

fix 1 all npt temp 300.0 300.0 0.1 iso 1.0 1.0 1.0 tchain 3 pchain 3

restart 1000 restart.*.eq
reset_timestep 0
run 100000

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions