[Q&A] DP with i-PI best practices #5427

JelleLagerweij · 2026-04-28T09:12:57Z

JelleLagerweij
Apr 28, 2026

Question

In short:
The DeePMD-kit documentation only shows how to run i-PI with a single dp_ipi client, but PIMD/RPMD simulations typically need many more beads than available GPUs. For large models, running one dp_ipi client per GPU works well. For small/fast models on modern hardware (e.g. DPA1 on H100s), GPU utilization per client can be below 50%, so running multiple dp_ipi clients per GPU or batching bead evaluations within a single client might help — but this is undocumented. Additionally, dp_ipi uses an O(N²) native neighbor list, making LAMMPS (with its O(N) bin-based neighbor list) potentially a better i-PI client for larger systems. I want to improve the manual, but before I do, I want to have some inputs from the community: Does anyone have experience with using underutilized GPU's by oversubscribing the gpus? Did anyone already find a guideline for when to switch from dp_ipi to the lammps ipi interaction? Are there any other "best practices" for ipi relevant to include in the manual?.

The long version:
I am using deepmdkit with the prepared dp_ipi force engine. Overall, I am even quite happy with the performance of my simulations using classical nuclei (both in accuracy and in simulation speed). I did find the instructions in the manual incomplete. It shows how to run ipi with a host and 1 dp_ipi client. However, the real use case of ipi is of course the PIMD or RPMD simulations using multiple beads. Many PIMD or RPMD simulations use 16 or more beads, and I believe that most users prefer to not use more than 1 gpu node (which is typically 4 or 8 gpu's per node on HPCs). So there is especially for ipi a use case where there are more beads in the simulation than gpus available. The only instance of the manual that states something regarding this is this sentence "It is noted that multiple instances of the client allow for computing, in parallel, the interactions of multiple replicas of the path-integral MD." (Documentation Page 10.4).

For the larger models (DPA2 or DPA3 with more layers and many types), there is a very simple (and logical) solution: ipi is perfectly fine with using more beads than clients. I would expect the following setup there

# suppose 32 beads and 4 gpus
# Some DP preamble setting the runtime environment variables
# see https://docs.deepmodeling.com/projects/deepmd/en/stable/env.html

# launch ipi with some safety sleep afterwards
i-pi input.xml &
sleep 5

# launch 1 dp_ipi task per gpu while binding actively to a specific gpu
for GPU_ID in 0 1 2 3; do
    CUDA_VISIBLE_DEVICES=$GPU_ID dp_ipi water.json &
done
wait

Now ipi will stream beads 0 through 3 to dp_ipi clients 0 through 3, wait for the result and then stream 4 through 7 to dp_ipi clients 0 through 3 etc. With some leftover overheads, this is about 8 times slower than 1 bead and 1 dp_ipi client. This is very efficient as 1 dp_ipi client with an expensive MLIP loads most modern GPUs fully.

However, many of us are still using smaller MLIP models on HPCs with very modern hardware (think of DPA1 with only a few atom types on nodes with H100 GPUs). This makes sense, as this results in acceptable simulation speeds with larger numbers of atoms and the highly complex MLIPs are still (for now) too slow to run tens of ns with larger (or more) molecules. From my testing, a medium or small DPA2 or DPA1 model can often stay under 50% gpu utilization in all relevant categories (% memory usage, % memory bandwidth, % compute) using a single bead and a single H100 GPU. Ideally, it would be better if 1 dp_ipi client could handle multiple bead requests at once if it recognizes underutilization of the GPUs (or via an explicit additional option in the water.json file, something like "batch_evaluation": 2 to make every dp_ipi automatically handle 2 beads at once). However, I could not find something like this in the manual. Therefore, I thought that maybe something like this might perform better:

# suppose 32 beads and 4 gpus
# Some DP preamble setting the runtime environment variables
# see https://docs.deepmodeling.com/projects/deepmd/en/stable/env.html

N_GPUS=4
N_CLIENTS_PER_GPU=2  # increase to 2, 4, 8 etc. to oversubscribe

# launch ipi with some safety sleep afterwards
i-pi input.xml &
sleep 5

# launch N_CLIENTS_PER_GPU dp_ipi tasks per gpu while binding actively to a specific gpu
for GPU_ID in $(seq 0 $((N_GPUS - 1))); do
    for CLIENT in $(seq 1 $N_CLIENTS_PER_GPU); do
        CUDA_VISIBLE_DEVICES=$GPU_ID dp_ipi water.json &
    done
done

wait

Obviously using 8 dp_ipi clients/gpu would come with significant performance penalties and overheads. However, 2 dp_ipi clients/gpu might be useful. I plan to do some benchmarking for this soon.

What is missing from the documentation is a discussion on how this is specifically relevant for path integral MD simulations in i-PI. I can write this, but before I do, I want to collect some ideas from other (maybe more experienced) people using dp and i-PI.

On neighbor lists and LAMMPS as an alternative client: Going through the source code of dp_ipi, I could not find any reference to a special neighbor list or Verlet list. I therefore assumed thatit uses the same default neighbor list as the one indicated in the DeePMD-kit inference documentation, which states that the native neighbor list is O(N²) in complexity — fine for small systems, but increasingly costly as system size grows. Note that this is in the python documentation, but the C/C++ documetation also only states how to connect other neighbor lists (while i assume that the default is thus the same). By contrast, using LAMMPS as the i-PI client (with pair_style deepmd) gives you a proper bin-based O(N) neighbor list with skin distance and rebuild logic. Importantly, switching to LAMMPS as the client does not require a major workflow change: the ffsocket setup in input.xml is identical, and the launch script simply replaces dp_ipi water.json with lmp -i in.lammps in the for loop. Does anyone have experience comparing dp_ipi versus LAMMPS as an i-PI client for PIMD, particularly regarding neighbor list overhead at larger system sizes?

DeePMD-kit Version

3.1.3

Backend and its version

No response

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

No response

Details

No response

Reproducible Example, Input Files, and Commands

No response

Further Information, Files, and Links

No response

JelleLagerweij · 2026-04-28T09:50:40Z

JelleLagerweij
Apr 28, 2026
Author

This topic is somewhat related to the ipi/examples/hpc_scripts/slurm_gpu folder. Please see https://github.com/i-pi/i-pi/tree/main/examples/hpc_scripts/slurm_gpu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Q&A] DP with i-PI best practices #5427

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

[Q&A] DP with i-PI best practices #5427

Uh oh!

Uh oh!

JelleLagerweij Apr 28, 2026

Question

DeePMD-kit Version

Backend and its version

Python Version, CUDA Version, GCC Version, LAMMPS Version, etc

Details

Reproducible Example, Input Files, and Commands

Further Information, Files, and Links

Replies: 1 comment

Uh oh!

JelleLagerweij Apr 28, 2026 Author

JelleLagerweij
Apr 28, 2026

JelleLagerweij
Apr 28, 2026
Author