Bug summary
dp --pt eval-desc runs out of GPU memory when extracting atomic descriptors with the DPA4-Plus pre-trained model on several larger systems, although the normal dp --pt test inference workflow with the same model/checkpoint completes successfully in the same environment.
The failure is reproducible on a single deepmd/npy/mixed system with 216 atoms and 59 frames. DeePMD auto batch size reduction reaches batch size 1, but the descriptor evaluation still fails with:
deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!
The CUDA traceback points to the SeZM/SO2 descriptor path:
deepmd/pt/model/descriptor/sezm_nn/so2.py, inside torch.bmm(D_m_prime, x_src).
I would like to confirm whether this is expected behavior for DPA4-Plus eval-desc, or whether eval-desc should support further chunking/streaming to avoid materializing such a large intermediate descriptor computation.
DeePMD-kit Version
DeePMD-kit v0.1.dev1+g27a18b604
Backend and its version
PyTorch backend: torch 2.11.0+cu126
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
Model:
DPA4-Plus pretrained model, specifically DPA4-Plus-OMat24-16M.pt.
Minimal failing input:
A single DPData system, sampled_dpdata/216, with 216 atoms and 59 frames.
The DPData directory contains:
sampled_dpdata/216/type.raw
sampled_dpdata/216/type_map.raw
sampled_dpdata/216/set.000/box.npy
sampled_dpdata/216/set.000/coord.npy
sampled_dpdata/216/set.000/energy.npy
sampled_dpdata/216/set.000/force.npy
sampled_dpdata/216/set.000/real_atom_types.npy
sampled_dpdata/216/set.000/spin.npy
sampled_dpdata/216/set.000/virial.npy
Failing command:
dp --pt eval-desc \
-s sampled_dpdata/216 \
-m DPA4-Plus-OMat24-16M.pt \
-o desc_216
The original full-run command was:
dp --pt eval-desc \
-s sampled_dpdata \
-m model.ckpt.pt \
-o desc_train
Successful inference comparison:
dp --pt test \
-s other_dpdata \
-m model.ckpt.pt \
-d results
In the same workflow, four dp --pt test Slurm jobs completed successfully, while dp --pt eval-desc failed on the
training descriptor job. The failing descriptor job first reached sampled_dpdata/216:
[2026-06-05 03:10:23,331] DEEPMD INFO # processing system : .../sampled_dpdata/216
[2026-06-05 03:10:23,349] DEEPMD INFO # evaluating descriptors for 59 frames
Relevant traceback:
File ".../site-packages/deepmd/pt/model/descriptor/sezm_nn/so2.py", line 1313, in forward
x_local = torch.bmm(D_m_prime, x_src) # (E, D_m, C_wide)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 240.00 MiB.
GPU 0 has a total capacity of 31.73 GiB of which 226.69 MiB is free.
Including non-PyTorch memory, this process has 31.51 GiB memory in use.
Of the allocated memory 30.94 GiB is allocated by PyTorch, and 162.59 MiB is reserved by PyTorch but unallocated.
deepmd.utils.errors.OutOfMemoryError:
The callable still throws an out-of-memory (OOM) error even when batch size is 1!
Environment variables used in the Slurm job:
export CUDA_HOME=/opt/devtools/nvidia/cuda-12.6.3
export CUDAToolkit_ROOT=${CUDA_HOME}
export TORCH_CUDA_ARCH_LIST=7.0
export CMAKE_CUDA_ARCHITECTURES=70
export DP_VARIANT=cuda
export DP_ENABLE_PYTORCH=1
export DP_ENABLE_TENSORFLOW=0
export DP_COMPILE_INFER=0
export TORCHDYNAMO_DISABLE=1
export TORCHINDUCTOR_COMPILE_THREADS=1
export DP_INTERFACE_PREC=high
export OMP_NUM_THREADS=2
Additional observation:
A CPU fallback attempt for the same sampled_dpdata/216 system was also killed by Slurm due to memory usage, so the issue does not appear to be only CUDA memory fragmentation.
### Steps to Reproduce
1. Prepare a DeePMD-kit environment with the PyTorch backend and CUDA support.
2. Use the DPA4-Plus pretrained model:
`DPA4-Plus-OMat24-16M.pt`.
3. Prepare the attached minimal DPData system:
`sampled_dpdata/216`, containing 216 atoms and 59 frames.
4. Run descriptor extraction:
```bash
dp --pt eval-desc \
-s sampled_dpdata/216 \
-m DPA4-Plus-OMat24-16M.pt \
-o desc_216
-
Observe that dp --pt eval-desc fails with CUDA OOM inside the SeZM/SO2 descriptor block, even after auto batch size is reduced to 1.
-
For comparison, run normal inference with the same model/checkpoint:
dp --pt test
-s
-m DPA4-Plus-OMat24-16M.pt
-d results
Expected behavior:
dp --pt eval-desc should either complete, expose a way to further chunk descriptor extraction, or provide a clearer
documented limitation for DPA4-Plus descriptor extraction on larger systems.
Actual behavior:
dp --pt eval-desc fails with OOM at batch size 1.
Further Information, Files, and Links
issue_dpa4_plus_eval_desc_oom.tar.gz
Bug summary
dp --pt eval-descruns out of GPU memory when extracting atomic descriptors with the DPA4-Plus pre-trained model on several larger systems, although the normaldp --pt testinference workflow with the same model/checkpoint completes successfully in the same environment.The failure is reproducible on a single deepmd/npy/mixed system with 216 atoms and 59 frames. DeePMD auto batch size reduction reaches batch size 1, but the descriptor evaluation still fails with:
deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!The CUDA traceback points to the SeZM/SO2 descriptor path:
deepmd/pt/model/descriptor/sezm_nn/so2.py, insidetorch.bmm(D_m_prime, x_src).I would like to confirm whether this is expected behavior for DPA4-Plus
eval-desc, or whethereval-descshould support further chunking/streaming to avoid materializing such a large intermediate descriptor computation.DeePMD-kit Version
DeePMD-kit v0.1.dev1+g27a18b604
Backend and its version
PyTorch backend: torch 2.11.0+cu126
How did you download the software?
Built from source
Input Files, Running Commands, Error Log, etc.
Model:
DPA4-Plus pretrained model, specifically
DPA4-Plus-OMat24-16M.pt.Minimal failing input:
A single DPData system,
sampled_dpdata/216, with 216 atoms and 59 frames.The DPData directory contains:
Observe that dp --pt eval-desc fails with CUDA OOM inside the SeZM/SO2 descriptor block, even after auto batch size is reduced to 1.
For comparison, run normal inference with the same model/checkpoint:
dp --pt test
-s
-m DPA4-Plus-OMat24-16M.pt
-d results
Expected behavior:
dp --pt eval-desc should either complete, expose a way to further chunk descriptor extraction, or provide a clearer
documented limitation for DPA4-Plus descriptor extraction on larger systems.
Actual behavior:
dp --pt eval-desc fails with OOM at batch size 1.
Further Information, Files, and Links
issue_dpa4_plus_eval_desc_oom.tar.gz