[Q&A] DP with i-PI best practices #5427
Unanswered
JelleLagerweij
asked this question in
Q&A
Replies: 1 comment
-
|
This topic is somewhat related to the ipi/examples/hpc_scripts/slurm_gpu folder. Please see https://github.com/i-pi/i-pi/tree/main/examples/hpc_scripts/slurm_gpu. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Question
In short:
The DeePMD-kit documentation only shows how to run i-PI with a single
dp_ipiclient, but PIMD/RPMD simulations typically need many more beads than available GPUs. For large models, running onedp_ipiclient per GPU works well. For small/fast models on modern hardware (e.g. DPA1 on H100s), GPU utilization per client can be below 50%, so running multipledp_ipiclients per GPU or batching bead evaluations within a single client might help — but this is undocumented. Additionally,dp_ipiuses an O(N²) native neighbor list, making LAMMPS (with its O(N) bin-based neighbor list) potentially a better i-PI client for larger systems. I want to improve the manual, but before I do, I want to have some inputs from the community: Does anyone have experience with using underutilized GPU's by oversubscribing the gpus? Did anyone already find a guideline for when to switch from dp_ipi to the lammps ipi interaction? Are there any other "best practices" for ipi relevant to include in the manual?.The long version:
I am using deepmdkit with the prepared dp_ipi force engine. Overall, I am even quite happy with the performance of my simulations using classical nuclei (both in accuracy and in simulation speed). I did find the instructions in the manual incomplete. It shows how to run ipi with a host and 1 dp_ipi client. However, the real use case of ipi is of course the PIMD or RPMD simulations using multiple beads. Many PIMD or RPMD simulations use 16 or more beads, and I believe that most users prefer to not use more than 1 gpu node (which is typically 4 or 8 gpu's per node on HPCs). So there is especially for ipi a use case where there are more beads in the simulation than gpus available. The only instance of the manual that states something regarding this is this sentence "It is noted that multiple instances of the client allow for computing, in parallel, the interactions of multiple replicas of the path-integral MD." (Documentation Page 10.4).
For the larger models (DPA2 or DPA3 with more layers and many types), there is a very simple (and logical) solution: ipi is perfectly fine with using more beads than clients. I would expect the following setup there
Now ipi will stream beads 0 through 3 to dp_ipi clients 0 through 3, wait for the result and then stream 4 through 7 to dp_ipi clients 0 through 3 etc. With some leftover overheads, this is about 8 times slower than 1 bead and 1 dp_ipi client. This is very efficient as 1 dp_ipi client with an expensive MLIP loads most modern GPUs fully.
However, many of us are still using smaller MLIP models on HPCs with very modern hardware (think of DPA1 with only a few atom types on nodes with H100 GPUs). This makes sense, as this results in acceptable simulation speeds with larger numbers of atoms and the highly complex MLIPs are still (for now) too slow to run tens of ns with larger (or more) molecules. From my testing, a medium or small DPA2 or DPA1 model can often stay under 50% gpu utilization in all relevant categories (% memory usage, % memory bandwidth, % compute) using a single bead and a single H100 GPU. Ideally, it would be better if 1 dp_ipi client could handle multiple bead requests at once if it recognizes underutilization of the GPUs (or via an explicit additional option in the
water.jsonfile, something like"batch_evaluation": 2to make every dp_ipi automatically handle 2 beads at once). However, I could not find something like this in the manual. Therefore, I thought that maybe something like this might perform better:Obviously using 8 dp_ipi clients/gpu would come with significant performance penalties and overheads. However, 2 dp_ipi clients/gpu might be useful. I plan to do some benchmarking for this soon.
What is missing from the documentation is a discussion on how this is specifically relevant for path integral MD simulations in i-PI. I can write this, but before I do, I want to collect some ideas from other (maybe more experienced) people using dp and i-PI.
On neighbor lists and LAMMPS as an alternative client: Going through the source code of dp_ipi, I could not find any reference to a special neighbor list or Verlet list. I therefore assumed thatit uses the same default neighbor list as the one indicated in the DeePMD-kit inference documentation, which states that the native neighbor list is O(N²) in complexity — fine for small systems, but increasingly costly as system size grows. Note that this is in the python documentation, but the C/C++ documetation also only states how to connect other neighbor lists (while i assume that the default is thus the same). By contrast, using LAMMPS as the i-PI client (with
pair_style deepmd) gives you a proper bin-based O(N) neighbor list with skin distance and rebuild logic. Importantly, switching to LAMMPS as the client does not require a major workflow change: theffsocketsetup ininput.xmlis identical, and the launch script simply replacesdp_ipi water.jsonwithlmp -i in.lammpsin the for loop. Does anyone have experience comparingdp_ipiversus LAMMPS as an i-PI client for PIMD, particularly regarding neighbor list overhead at larger system sizes?DeePMD-kit Version
3.1.3
Backend and its version
No response
Python Version, CUDA Version, GCC Version, LAMMPS Version, etc
No response
Details
No response
Reproducible Example, Input Files, and Commands
No response
Further Information, Files, and Links
No response
Beta Was this translation helpful? Give feedback.
All reactions