[BUG] `dp --pt eval-desc` OOMs for DPA4-Plus descriptors while `dp --pt test` inference completes

### Bug summary

`dp --pt eval-desc` runs out of GPU memory when extracting atomic descriptors with the DPA4-Plus pre-trained model on several larger systems, although the normal `dp --pt test` inference workflow with the same model/checkpoint completes successfully in the same environment.

The failure is reproducible on a single deepmd/npy/mixed system with 216 atoms and 59 frames. DeePMD auto batch size reduction reaches batch size 1, but the descriptor evaluation still fails with:

`deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!`

The CUDA traceback points to the SeZM/SO2 descriptor path:

`deepmd/pt/model/descriptor/sezm_nn/so2.py`, inside `torch.bmm(D_m_prime, x_src)`.

I would like to confirm whether this is expected behavior for DPA4-Plus `eval-desc`, or whether `eval-desc` should support further chunking/streaming to avoid materializing such a large intermediate descriptor computation.

### DeePMD-kit Version

DeePMD-kit v0.1.dev1+g27a18b604

### Backend and its version

PyTorch backend: torch 2.11.0+cu126

### How did you download the software?

Built from source

### Input Files, Running Commands, Error Log, etc.


Model:
  DPA4-Plus pretrained model, specifically `DPA4-Plus-OMat24-16M.pt`.

Minimal failing input:
  A single DPData system, `sampled_dpdata/216`, with 216 atoms and 59 frames.

  The DPData directory contains:

  ```text
  sampled_dpdata/216/type.raw
  sampled_dpdata/216/type_map.raw
  sampled_dpdata/216/set.000/box.npy
  sampled_dpdata/216/set.000/coord.npy
  sampled_dpdata/216/set.000/energy.npy
  sampled_dpdata/216/set.000/force.npy
  sampled_dpdata/216/set.000/real_atom_types.npy
  sampled_dpdata/216/set.000/spin.npy
  sampled_dpdata/216/set.000/virial.npy

  Failing command:

  dp --pt eval-desc \
    -s sampled_dpdata/216 \
    -m DPA4-Plus-OMat24-16M.pt \
    -o desc_216

  The original full-run command was:

  dp --pt eval-desc \
    -s sampled_dpdata \
    -m model.ckpt.pt \
    -o desc_train

  Successful inference comparison:

  dp --pt test \
    -s other_dpdata \
    -m model.ckpt.pt \
    -d results

  In the same workflow, four dp --pt test Slurm jobs completed successfully, while dp --pt eval-desc failed on the
  training descriptor job. The failing descriptor job first reached sampled_dpdata/216:

  [2026-06-05 03:10:23,331] DEEPMD INFO    # processing system : .../sampled_dpdata/216
  [2026-06-05 03:10:23,349] DEEPMD INFO    # evaluating descriptors for 59 frames

  Relevant traceback:

  File ".../site-packages/deepmd/pt/model/descriptor/sezm_nn/so2.py", line 1313, in forward
      x_local = torch.bmm(D_m_prime, x_src)  # (E, D_m, C_wide)

  torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 240.00 MiB.
  GPU 0 has a total capacity of 31.73 GiB of which 226.69 MiB is free.
  Including non-PyTorch memory, this process has 31.51 GiB memory in use.
  Of the allocated memory 30.94 GiB is allocated by PyTorch, and 162.59 MiB is reserved by PyTorch but unallocated.

  deepmd.utils.errors.OutOfMemoryError:
  The callable still throws an out-of-memory (OOM) error even when batch size is 1!

  Environment variables used in the Slurm job:

  export CUDA_HOME=/opt/devtools/nvidia/cuda-12.6.3
  export CUDAToolkit_ROOT=${CUDA_HOME}
  export TORCH_CUDA_ARCH_LIST=7.0
  export CMAKE_CUDA_ARCHITECTURES=70
  export DP_VARIANT=cuda
  export DP_ENABLE_PYTORCH=1
  export DP_ENABLE_TENSORFLOW=0
  export DP_COMPILE_INFER=0
  export TORCHDYNAMO_DISABLE=1
  export TORCHINDUCTOR_COMPILE_THREADS=1
  export DP_INTERFACE_PREC=high
  export OMP_NUM_THREADS=2

  Additional observation:
  A CPU fallback attempt for the same sampled_dpdata/216 system was also killed by Slurm due to memory usage, so the issue does not appear to be only CUDA memory fragmentation.

### Steps to Reproduce

1. Prepare a DeePMD-kit environment with the PyTorch backend and CUDA support.

2. Use the DPA4-Plus pretrained model:
     `DPA4-Plus-OMat24-16M.pt`.

3. Prepare the attached minimal DPData system:
     `sampled_dpdata/216`, containing 216 atoms and 59 frames.

4. Run descriptor extraction:

 ```bash
  dp --pt eval-desc \
    -s sampled_dpdata/216 \
    -m DPA4-Plus-OMat24-16M.pt \
    -o desc_216
```
5. Observe that dp --pt eval-desc fails with CUDA OOM inside the SeZM/SO2 descriptor block, even after auto batch size is reduced to 1.

6. For comparison, run normal inference with the same model/checkpoint:

  dp --pt test \
    -s <a compatible DPData test set> \
    -m DPA4-Plus-OMat24-16M.pt \
    -d results

Expected behavior:
  dp --pt eval-desc should either complete, expose a way to further chunk descriptor extraction, or provide a clearer
  documented limitation for DPA4-Plus descriptor extraction on larger systems.

Actual behavior:
  dp --pt eval-desc fails with OOM at batch size 1.

### Further Information, Files, and Links

[issue_dpa4_plus_eval_desc_oom.tar.gz](https://github.com/user-attachments/files/28737322/issue_dpa4_plus_eval_desc_oom.tar.gz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] `dp --pt eval-desc` OOMs for DPA4-Plus descriptors while `dp --pt test` inference completes #5507

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Further Information, Files, and Links

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[BUG] dp --pt eval-desc OOMs for DPA4-Plus descriptors while dp --pt test inference completes #5507

Description

Bug summary

DeePMD-kit Version

Backend and its version

How did you download the software?

Input Files, Running Commands, Error Log, etc.

Further Information, Files, and Links

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[BUG] `dp --pt eval-desc` OOMs for DPA4-Plus descriptors while `dp --pt test` inference completes #5507