Skip to content

[BUG] dp --pt eval-desc OOMs for DPA4-Plus descriptors while dp --pt test inference completes #5507

Description

@QuantumMisaka

Bug summary

dp --pt eval-desc runs out of GPU memory when extracting atomic descriptors with the DPA4-Plus pre-trained model on several larger systems, although the normal dp --pt test inference workflow with the same model/checkpoint completes successfully in the same environment.

The failure is reproducible on a single deepmd/npy/mixed system with 216 atoms and 59 frames. DeePMD auto batch size reduction reaches batch size 1, but the descriptor evaluation still fails with:

deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!

The CUDA traceback points to the SeZM/SO2 descriptor path:

deepmd/pt/model/descriptor/sezm_nn/so2.py, inside torch.bmm(D_m_prime, x_src).

I would like to confirm whether this is expected behavior for DPA4-Plus eval-desc, or whether eval-desc should support further chunking/streaming to avoid materializing such a large intermediate descriptor computation.

DeePMD-kit Version

DeePMD-kit v0.1.dev1+g27a18b604

Backend and its version

PyTorch backend: torch 2.11.0+cu126

How did you download the software?

Built from source

Input Files, Running Commands, Error Log, etc.

Model:
DPA4-Plus pretrained model, specifically DPA4-Plus-OMat24-16M.pt.

Minimal failing input:
A single DPData system, sampled_dpdata/216, with 216 atoms and 59 frames.

The DPData directory contains:

sampled_dpdata/216/type.raw
sampled_dpdata/216/type_map.raw
sampled_dpdata/216/set.000/box.npy
sampled_dpdata/216/set.000/coord.npy
sampled_dpdata/216/set.000/energy.npy
sampled_dpdata/216/set.000/force.npy
sampled_dpdata/216/set.000/real_atom_types.npy
sampled_dpdata/216/set.000/spin.npy
sampled_dpdata/216/set.000/virial.npy

Failing command:

dp --pt eval-desc \
  -s sampled_dpdata/216 \
  -m DPA4-Plus-OMat24-16M.pt \
  -o desc_216

The original full-run command was:

dp --pt eval-desc \
  -s sampled_dpdata \
  -m model.ckpt.pt \
  -o desc_train

Successful inference comparison:

dp --pt test \
  -s other_dpdata \
  -m model.ckpt.pt \
  -d results

In the same workflow, four dp --pt test Slurm jobs completed successfully, while dp --pt eval-desc failed on the
training descriptor job. The failing descriptor job first reached sampled_dpdata/216:

[2026-06-05 03:10:23,331] DEEPMD INFO    # processing system : .../sampled_dpdata/216
[2026-06-05 03:10:23,349] DEEPMD INFO    # evaluating descriptors for 59 frames

Relevant traceback:

File ".../site-packages/deepmd/pt/model/descriptor/sezm_nn/so2.py", line 1313, in forward
    x_local = torch.bmm(D_m_prime, x_src)  # (E, D_m, C_wide)

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 240.00 MiB.
GPU 0 has a total capacity of 31.73 GiB of which 226.69 MiB is free.
Including non-PyTorch memory, this process has 31.51 GiB memory in use.
Of the allocated memory 30.94 GiB is allocated by PyTorch, and 162.59 MiB is reserved by PyTorch but unallocated.

deepmd.utils.errors.OutOfMemoryError:
The callable still throws an out-of-memory (OOM) error even when batch size is 1!

Environment variables used in the Slurm job:

export CUDA_HOME=/opt/devtools/nvidia/cuda-12.6.3
export CUDAToolkit_ROOT=${CUDA_HOME}
export TORCH_CUDA_ARCH_LIST=7.0
export CMAKE_CUDA_ARCHITECTURES=70
export DP_VARIANT=cuda
export DP_ENABLE_PYTORCH=1
export DP_ENABLE_TENSORFLOW=0
export DP_COMPILE_INFER=0
export TORCHDYNAMO_DISABLE=1
export TORCHINDUCTOR_COMPILE_THREADS=1
export DP_INTERFACE_PREC=high
export OMP_NUM_THREADS=2

Additional observation:
A CPU fallback attempt for the same sampled_dpdata/216 system was also killed by Slurm due to memory usage, so the issue does not appear to be only CUDA memory fragmentation.

### Steps to Reproduce

1. Prepare a DeePMD-kit environment with the PyTorch backend and CUDA support.

2. Use the DPA4-Plus pretrained model:
   `DPA4-Plus-OMat24-16M.pt`.

3. Prepare the attached minimal DPData system:
   `sampled_dpdata/216`, containing 216 atoms and 59 frames.

4. Run descriptor extraction:

```bash
dp --pt eval-desc \
  -s sampled_dpdata/216 \
  -m DPA4-Plus-OMat24-16M.pt \
  -o desc_216
  1. Observe that dp --pt eval-desc fails with CUDA OOM inside the SeZM/SO2 descriptor block, even after auto batch size is reduced to 1.

  2. For comparison, run normal inference with the same model/checkpoint:

dp --pt test
-s
-m DPA4-Plus-OMat24-16M.pt
-d results

Expected behavior:
dp --pt eval-desc should either complete, expose a way to further chunk descriptor extraction, or provide a clearer
documented limitation for DPA4-Plus descriptor extraction on larger systems.

Actual behavior:
dp --pt eval-desc fails with OOM at batch size 1.

Further Information, Files, and Links

issue_dpa4_plus_eval_desc_oom.tar.gz

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions