Skip to content

[Feature Request] Device-resident C++ API (compute_device) to eliminate host round-trips for GPU-native MD engines #5574

Description

@SchrodingersCattt

Motivation

GPU-native MD engines like GPUMD keep all simulation data (positions, forces, velocities) on GPU throughout the MD loop. When calling DeePMD-kit through the current C++ API, the data flow becomes:

GPU positions → [D2H] → std::vector<double> → DeepPot::compute()
  → internally: copy_coord + build_nlist (CPU) → tensor H2D → model forward (GPU) → D2H
    → std::vector<double> forces → [H2D] → GPU forces

This results in 3 unnecessary host round-trips per step: positions D2H, internal nlist construction on CPU, and forces H2D. Profiling on A800 (DPA4 2l16, .pt2) confirms deepmd_compute occupies >99.5% of per-step wall time in GPUMD's dp.cu bridge, and the GPUMD↔LAMMPS throughput gap is:

System Size GPUMD/LAMMPS (standard) GPUMD/LAMMPS (triton)
250 atoms 0.36x 0.30x
1k atoms 0.80x 0.59x
10k atoms 0.96x 0.86x

LAMMPS is faster primarily because it calls the with-nlist overload (compute(..., nghost, inlist, ago, ...)) and benefits from:

  1. Passing a pre-built neighbor list — DeePMD skips copy_coord + build_nlist
  2. ago-based caching — firstneigh_tensor is reused on GPU across non-rebuild steps (~90% of steps with skin distance)

GPUMD cannot use this path because GPUMD's internal neighbor list does not produce the correct extended ghost topology required by message-passing models (DPA2/3/4). Therefore GPUMD must use the no-nlist overload, which forces DeePMD to rebuild everything on CPU every step.

Proposal

Add a public device-resident C++ API that accepts torch::Tensor on the caller's device:

class DeepPot {
public:
  /// Device-resident compute: all tensors live on the same CUDA device.
  /// DeePMD handles ghost construction and neighbor list building on GPU
  /// (similar to VesinNeighborList in Python pt_expt).
  void compute_device(
      torch::Tensor& energy,        // [1] or [nheads], on device
      torch::Tensor& force,         // [natoms, 3], on device
      torch::Tensor& virial,        // [9], on device
      const torch::Tensor& coord,   // [natoms, 3], on device
      const torch::Tensor& atype,   // [natoms], on device (int32/int64)
      const torch::Tensor& box,     // [3, 3], on device
      const torch::Tensor& fparam = {},
      const torch::Tensor& aparam = {});
};

In this path, DeePMD would:

  1. Accept device-resident coord/atype/box directly (no D2H)
  2. Build the neighbor list on GPU (the VesinNeighborList in deepmd/pt_expt/utils/vesin_neighbor_list.py already does this via vesin.torch)
  3. Run model forward on GPU
  4. Write forces directly to a device tensor (no H2D)

This eliminates all host round-trips. The internal DeepPotPTExpt::run_model() already operates on torch::Tensor — the main work is exposing it through the public API with proper ghost/nlist handling on GPU.

Why not just pass a neighbor list from GPUMD?

For message-passing models (DPA2/3/4), the neighbor list must include multi-layer ghost atoms with correct ghost-ghost connectivity. GPUMD's own neighbor list structure is incompatible with this requirement. Having DeePMD build the correct topology internally (as it already does in the no-nlist path) but on GPU would solve both correctness and performance.

Scope

  • In scope: DeepPotPTExpt (.pt2 AOTInductor) backend, where the full pipeline can stay on GPU.
  • Nice to have: DeepPotPT (JIT .pth) backend support.
  • Out of scope: TF backend.

Existing internal infrastructure

  • DeepPotPTExpt::run_model() already accepts torch::Tensor (coord, atype, nlist, mapping) — it's currently private.
  • VesinNeighborList in deepmd/pt_expt/utils/vesin_neighbor_list.py builds neighbor lists on GPU via vesin.torch.
  • build_nlist_gpu() in source/lib/src/gpu/neighbor_list.cu builds neighbor lists on GPU for the TF op layer.
  • convert_nlist_gpu_device() converts host InputNlist to device memory.

These pieces exist but are not wired together into a public C++ API.

Impact

This would benefit any GPU-native MD engine integrating DeePMD-kit (GPUMD, and potentially future engines). For GPUMD specifically, it would close the remaining throughput gap vs LAMMPS for small-to-medium systems.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions