Motivation
GPU-native MD engines like GPUMD keep all simulation data (positions, forces, velocities) on GPU throughout the MD loop. When calling DeePMD-kit through the current C++ API, the data flow becomes:
GPU positions → [D2H] → std::vector<double> → DeepPot::compute()
→ internally: copy_coord + build_nlist (CPU) → tensor H2D → model forward (GPU) → D2H
→ std::vector<double> forces → [H2D] → GPU forces
This results in 3 unnecessary host round-trips per step: positions D2H, internal nlist construction on CPU, and forces H2D. Profiling on A800 (DPA4 2l16, .pt2) confirms deepmd_compute occupies >99.5% of per-step wall time in GPUMD's dp.cu bridge, and the GPUMD↔LAMMPS throughput gap is:
| System Size |
GPUMD/LAMMPS (standard) |
GPUMD/LAMMPS (triton) |
| 250 atoms |
0.36x |
0.30x |
| 1k atoms |
0.80x |
0.59x |
| 10k atoms |
0.96x |
0.86x |
LAMMPS is faster primarily because it calls the with-nlist overload (compute(..., nghost, inlist, ago, ...)) and benefits from:
- Passing a pre-built neighbor list — DeePMD skips
copy_coord + build_nlist
ago-based caching — firstneigh_tensor is reused on GPU across non-rebuild steps (~90% of steps with skin distance)
GPUMD cannot use this path because GPUMD's internal neighbor list does not produce the correct extended ghost topology required by message-passing models (DPA2/3/4). Therefore GPUMD must use the no-nlist overload, which forces DeePMD to rebuild everything on CPU every step.
Proposal
Add a public device-resident C++ API that accepts torch::Tensor on the caller's device:
class DeepPot {
public:
/// Device-resident compute: all tensors live on the same CUDA device.
/// DeePMD handles ghost construction and neighbor list building on GPU
/// (similar to VesinNeighborList in Python pt_expt).
void compute_device(
torch::Tensor& energy, // [1] or [nheads], on device
torch::Tensor& force, // [natoms, 3], on device
torch::Tensor& virial, // [9], on device
const torch::Tensor& coord, // [natoms, 3], on device
const torch::Tensor& atype, // [natoms], on device (int32/int64)
const torch::Tensor& box, // [3, 3], on device
const torch::Tensor& fparam = {},
const torch::Tensor& aparam = {});
};
In this path, DeePMD would:
- Accept device-resident coord/atype/box directly (no D2H)
- Build the neighbor list on GPU (the
VesinNeighborList in deepmd/pt_expt/utils/vesin_neighbor_list.py already does this via vesin.torch)
- Run model forward on GPU
- Write forces directly to a device tensor (no H2D)
This eliminates all host round-trips. The internal DeepPotPTExpt::run_model() already operates on torch::Tensor — the main work is exposing it through the public API with proper ghost/nlist handling on GPU.
Why not just pass a neighbor list from GPUMD?
For message-passing models (DPA2/3/4), the neighbor list must include multi-layer ghost atoms with correct ghost-ghost connectivity. GPUMD's own neighbor list structure is incompatible with this requirement. Having DeePMD build the correct topology internally (as it already does in the no-nlist path) but on GPU would solve both correctness and performance.
Scope
- In scope:
DeepPotPTExpt (.pt2 AOTInductor) backend, where the full pipeline can stay on GPU.
- Nice to have:
DeepPotPT (JIT .pth) backend support.
- Out of scope: TF backend.
Existing internal infrastructure
DeepPotPTExpt::run_model() already accepts torch::Tensor (coord, atype, nlist, mapping) — it's currently private.
VesinNeighborList in deepmd/pt_expt/utils/vesin_neighbor_list.py builds neighbor lists on GPU via vesin.torch.
build_nlist_gpu() in source/lib/src/gpu/neighbor_list.cu builds neighbor lists on GPU for the TF op layer.
convert_nlist_gpu_device() converts host InputNlist to device memory.
These pieces exist but are not wired together into a public C++ API.
Impact
This would benefit any GPU-native MD engine integrating DeePMD-kit (GPUMD, and potentially future engines). For GPUMD specifically, it would close the remaining throughput gap vs LAMMPS for small-to-medium systems.
Motivation
GPU-native MD engines like GPUMD keep all simulation data (positions, forces, velocities) on GPU throughout the MD loop. When calling DeePMD-kit through the current C++ API, the data flow becomes:
This results in 3 unnecessary host round-trips per step: positions D2H, internal nlist construction on CPU, and forces H2D. Profiling on A800 (DPA4 2l16,
.pt2) confirmsdeepmd_computeoccupies >99.5% of per-step wall time in GPUMD'sdp.cubridge, and the GPUMD↔LAMMPS throughput gap is:LAMMPS is faster primarily because it calls the with-nlist overload (
compute(..., nghost, inlist, ago, ...)) and benefits from:copy_coord+build_nlistago-based caching —firstneigh_tensoris reused on GPU across non-rebuild steps (~90% of steps with skin distance)GPUMD cannot use this path because GPUMD's internal neighbor list does not produce the correct extended ghost topology required by message-passing models (DPA2/3/4). Therefore GPUMD must use the no-nlist overload, which forces DeePMD to rebuild everything on CPU every step.
Proposal
Add a public device-resident C++ API that accepts
torch::Tensoron the caller's device:In this path, DeePMD would:
VesinNeighborListindeepmd/pt_expt/utils/vesin_neighbor_list.pyalready does this viavesin.torch)This eliminates all host round-trips. The internal
DeepPotPTExpt::run_model()already operates ontorch::Tensor— the main work is exposing it through the public API with proper ghost/nlist handling on GPU.Why not just pass a neighbor list from GPUMD?
For message-passing models (DPA2/3/4), the neighbor list must include multi-layer ghost atoms with correct ghost-ghost connectivity. GPUMD's own neighbor list structure is incompatible with this requirement. Having DeePMD build the correct topology internally (as it already does in the no-nlist path) but on GPU would solve both correctness and performance.
Scope
DeepPotPTExpt(.pt2AOTInductor) backend, where the full pipeline can stay on GPU.DeepPotPT(JIT.pth) backend support.Existing internal infrastructure
DeepPotPTExpt::run_model()already acceptstorch::Tensor(coord, atype, nlist, mapping) — it's currentlyprivate.VesinNeighborListindeepmd/pt_expt/utils/vesin_neighbor_list.pybuilds neighbor lists on GPU viavesin.torch.build_nlist_gpu()insource/lib/src/gpu/neighbor_list.cubuilds neighbor lists on GPU for the TF op layer.convert_nlist_gpu_device()converts hostInputNlistto device memory.These pieces exist but are not wired together into a public C++ API.
Impact
This would benefit any GPU-native MD engine integrating DeePMD-kit (GPUMD, and potentially future engines). For GPUMD specifically, it would close the remaining throughput gap vs LAMMPS for small-to-medium systems.