[Feature Request] Device-resident C++ API (compute_device) to eliminate host round-trips for GPU-native MD engines

### Motivation

GPU-native MD engines like [GPUMD](https://github.com/brucefan1983/GPUMD) keep all simulation data (positions, forces, velocities) on GPU throughout the MD loop. When calling DeePMD-kit through the current C++ API, the data flow becomes:

```
GPU positions → [D2H] → std::vector<double> → DeepPot::compute()
  → internally: copy_coord + build_nlist (CPU) → tensor H2D → model forward (GPU) → D2H
    → std::vector<double> forces → [H2D] → GPU forces
```

This results in **3 unnecessary host round-trips per step**: positions D2H, internal nlist construction on CPU, and forces H2D. Profiling on A800 (DPA4 2l16, `.pt2`) confirms `deepmd_compute` occupies >99.5% of per-step wall time in GPUMD's `dp.cu` bridge, and the GPUMD↔LAMMPS throughput gap is:

| System Size | GPUMD/LAMMPS (standard) | GPUMD/LAMMPS (triton) |
|---|---|---|
| 250 atoms | 0.36x | 0.30x |
| 1k atoms | 0.80x | 0.59x |
| 10k atoms | 0.96x | 0.86x |

LAMMPS is faster primarily because it calls the with-nlist overload (`compute(..., nghost, inlist, ago, ...)`) and benefits from:
1. Passing a pre-built neighbor list — DeePMD skips `copy_coord` + `build_nlist`
2. `ago`-based caching — `firstneigh_tensor` is reused on GPU across non-rebuild steps (~90% of steps with skin distance)

GPUMD cannot use this path because GPUMD's internal neighbor list does not produce the correct extended ghost topology required by message-passing models (DPA2/3/4). Therefore GPUMD must use the no-nlist overload, which forces DeePMD to rebuild everything on CPU every step.

### Proposal

Add a public device-resident C++ API that accepts `torch::Tensor` on the caller's device:

```cpp
class DeepPot {
public:
  /// Device-resident compute: all tensors live on the same CUDA device.
  /// DeePMD handles ghost construction and neighbor list building on GPU
  /// (similar to VesinNeighborList in Python pt_expt).
  void compute_device(
      torch::Tensor& energy,        // [1] or [nheads], on device
      torch::Tensor& force,         // [natoms, 3], on device
      torch::Tensor& virial,        // [9], on device
      const torch::Tensor& coord,   // [natoms, 3], on device
      const torch::Tensor& atype,   // [natoms], on device (int32/int64)
      const torch::Tensor& box,     // [3, 3], on device
      const torch::Tensor& fparam = {},
      const torch::Tensor& aparam = {});
};
```

In this path, DeePMD would:
1. Accept device-resident coord/atype/box directly (no D2H)
2. Build the neighbor list **on GPU** (the `VesinNeighborList` in `deepmd/pt_expt/utils/vesin_neighbor_list.py` already does this via `vesin.torch`)
3. Run model forward on GPU
4. Write forces directly to a device tensor (no H2D)

This eliminates all host round-trips. The internal `DeepPotPTExpt::run_model()` already operates on `torch::Tensor` — the main work is exposing it through the public API with proper ghost/nlist handling on GPU.

### Why not just pass a neighbor list from GPUMD?

For message-passing models (DPA2/3/4), the neighbor list must include multi-layer ghost atoms with correct ghost-ghost connectivity. GPUMD's own neighbor list structure is incompatible with this requirement. Having DeePMD build the correct topology internally (as it already does in the no-nlist path) but **on GPU** would solve both correctness and performance.

### Scope

- **In scope**: `DeepPotPTExpt` (`.pt2` AOTInductor) backend, where the full pipeline can stay on GPU.
- **Nice to have**: `DeepPotPT` (JIT `.pth`) backend support.
- **Out of scope**: TF backend.

### Existing internal infrastructure

- `DeepPotPTExpt::run_model()` already accepts `torch::Tensor` (coord, atype, nlist, mapping) — it's currently `private`.
- `VesinNeighborList` in `deepmd/pt_expt/utils/vesin_neighbor_list.py` builds neighbor lists **on GPU** via `vesin.torch`.
- `build_nlist_gpu()` in `source/lib/src/gpu/neighbor_list.cu` builds neighbor lists on GPU for the TF op layer.
- `convert_nlist_gpu_device()` converts host `InputNlist` to device memory.

These pieces exist but are not wired together into a public C++ API.

### Impact

This would benefit any GPU-native MD engine integrating DeePMD-kit (GPUMD, and potentially future engines). For GPUMD specifically, it would close the remaining throughput gap vs LAMMPS for small-to-medium systems.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Device-resident C++ API (compute_device) to eliminate host round-trips for GPU-native MD engines #5574

Motivation

Proposal

Why not just pass a neighbor list from GPUMD?

Scope

Existing internal infrastructure

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature Request] Device-resident C++ API (compute_device) to eliminate host round-trips for GPU-native MD engines #5574

Description

Motivation

Proposal

Why not just pass a neighbor list from GPUMD?

Scope

Existing internal infrastructure

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions